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Foreword 


t has often been said that the science of psychology suffers from physics envy—a face- 

tious reference to the fact that most of its journals and training programs have assigned 
priority—indeed, often exclusive validity—to findings obtained by means of the prac- 
tices of older disciplines focused on the study of inanimate matter, such as physics. Data 
obtained from strictly controlled laboratory settings, where the effect of single variables 
can be observed, are considered to be the acme of psychological research. 

In my opinion, the ambivalence implicit in the tag is justified. The eagerness with 
which psychology has adopted the procedures and rituals of the more established sci- 
ences is reminiscent of a young girl preening herself in front of a mirror all dressed in 
her mother’s clothes—high heels and flowing scarves included. We forget too easily that 
physics, chemistry, and biology developed slowly over time, starting first as natural sci- 
ences based on observation, classification, and description. When Galileo started the first 
“scientific” experiments over 400 years ago, he was working with concepts that philoso- 
phers and protoscientists had found relevant to describe phenomena in the natural world. 
For centuries before, dimensions of reality such as time, distance, velocity, and mass had 
been studied, and to a certain extent measured. So the results of Galileo’s investigations 
made sense to individuals interested in such things, and their relevance was seen as con- 
vincing. In psychology, such a consensus was lacking when, over 100 years ago, Wilhelm 
Wundt performed the first psychomotor experiments in Leipzig—and in fact, it could be 
said that agreement as to what is worth measuring about the psyche, and how it should 
be measured, has become even weaker since Wundt’s time. Some psychologists privilege 
qualitative descriptions of subjective states, while others only take seriously the results of 
brain scans or chemical analyses of hormonal changes. Lacking a consensus as to what 
really counts as knowledge, much of the experimental sophistication can only speak to 
local interests and remains unpersuasive to everyone else. 

The lack of agreement as to what is worth studying about the human psyche is, 
of course, a consequence of the difference between what psychology studies as distinct 
from all the other sciences. Physicists, chemists, or even biologists need not be concerned 
about understanding what they are studying; they only need to know how matter, or liv- 
ing organisms, work—at least as seen from the perspective of the human observers. No 
one expects a chemist to question the experience of being a water molecule, nor a physi- 
cist to worry what an atom feels when it is split. This is presumably because molecules 
and atoms are not conscious systems. In psychology, however, it is essential to take into 
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account the level of complexity of the human organism, with its self-reflective, conscious 
potential; to fail to understand the subjective experience of people means to miss what is 
most unique, and perhaps most important, about them. 

To achieve such understanding, the experimental method has some obvious limita- 
tions. By its very nature, it forces the “subject” to act within a severely limited envi- 
ronment, and to comply with rules established by the experimenter. The possibility to 
express autonomy, self-determination, creativity, interest, or choice—or any of the other 
most interesting capacities that evolution has bestowed on human beings, and that our 
ancestors have slowly developed over tens of thousands of years—is incompatible with 
use of the experimental method. Heisenberg’s principle of indeterminacy works at the 
human level even more conspicuously than on the subatomic scale: The more rigorously 
the experimenter tries to measure a slice of behavior, the less accurate—or relevant—the 
account of the system being measured becomes. 

Inherent in the experimental method is the reduction of variables to those that can be 
manipulated in the laboratory. This allows for analytic precision but often yields results 
that have low ecological validity. For example, the well-known series of experiments by 
Stanley Milgram taught us a great deal about individual behavior. When unsuspecting 
subjects in his laboratory were asked by the experimenter to administer what looked 
like painful electric shocks to confederates mimicking acute discomfort, a great major- 
ity of subjects went on throwing the switch regardless of the pain they thought they 
were inflicting. The finding suggested that when confronted with requests by authority 
figures—in this case, the white-coated experimenter—ordinary citizens will obey even if 
it means causing mortal harm to another human being. This is certainly an important 
fact to know. But to extrapolate from these findings to behavior outside the psycholo- 
gist’s laboratory is quite unjustified. There are too many empirical examples from actual 
historical events that show how average people can be courageous, even heroic, when 
circumstances provide degrees of freedom that are usually present in real life but not in 
the laboratory, where they would be considered confounding variables. 

This volume presents a variety of new approaches that, while preserving what is best 
about the scientific method, avoid many of the pitfalls of the laboratory experimenta- 
tion that have been so successful in the physical sciences. In my remarks, I focus on the 
experience sampling method (ESM), with which I am most familiar, but the issues apply 
equally to all those studies that use systematic measurements of momentary behavior and 
experience collected in situ. 

The ESM was developed in the early 1970s as an attempt to adapt the scientific 
method to the understanding of human behavior and human experience—instead of the 
reverse; that is, to understanding only those aspects of human experience that were ame- 
nable to experimental laboratory investigation. This volume shows how much has been 
accomplished in the past few decades, and it bears witness to the fact that it is possible 
to use a set of methods that approximates scientific rigor without ignoring what is most 
interesting and unique about being human. These methods include event-contingent sam- 
pling of experience, behavioral sampling, activity sampling, neuroendocrine sampling, 
and physiological sampling. As this volume shows, there are now complementary tools to 
the ESM that measure not just experiences but also bodily physiology outside the labora- 
tory in everyday life. 

Of course, there have been earlier attempts to develop such methodologies. The lead- 
ers of the French Revolution had an almost religious belief in science, and several scholars 
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during the 19th century thought of applying systematic measures to the study of human 
habits—using diaries to reconstruct the daily lives of individuals in various social classes 
and occupations. It is interesting that these techniques were ignored by psychologists 
but became part of the toolkit of sociologists. In the United States, Russian-born Pitirim 
Sorokin, who taught sociology at Harvard, started pioneering work on time budgets in 
the late 1930s. For instance, he had over 500 students keep diaries to investigate whether 
happy persons were more likely to be altruistic—a clear precursor of current research in 
positive psychology (Sorokin, 1950). 

But most sociological studies did not follow Sorokin’s lead and ended up being 
more in the category of time-use diaries—useful to determine what people did in their 
normal environment, but not very informative about their inner lives. The next wave 
of time-use studies, which included some reports of subjective experience as well, were 
conducted at the University of Michigan, starting after World War II (e.g., Campbell, 
Converse, & Rodgers, 1976; Robinson, 1977). These, in turn, developed into the large- 
scale surveys that are currently canvassing on a global scale the time use, values, politi- 
cal opinions, and quality of life of entire populations (e.g., Gallup & Newport, 2007; 
Inglehart, 1978). 

Research methods for studying daily life, including ESM and its various permuta- 
tions, while partly inspired by these earlier studies, are in many ways a major improve- 
ment on them. One important development has been that instead of depending on a 
delayed reconstruction of what a person did and felt during the day, usually reported 
in the evening, the ESM relies on immediate descriptions of momentary events right as 
they are occurring, thereby avoiding the inevitable distortions caused by the passage of 
time. Another step has been the development of a more or less standard technique for a 
quantitative format that allows the measurement of the most salient dimensions of cogni- 
tive, affective, and motivational states at the moment the subject receives a signal. The 
psychometric properties of the ESM, including its reliability and validity, are presented in 
a recent volume by Hektner, Schmidt, and Csikszentmihalyi (2007). 

It is nice for psychologists to have a new set of methods to work with, but the real 
question is, what good are they? What can one learn about human psychology that can- 
not be learned by more conventional methods? Ways of obtaining knowledge are only 
means to an end, not to be evaluated on their own terms but only in terms of whether they 
actually help produce new knowledge. 

Trying to summarize all the findings obtained with the ESM and related methods 
would be too big a task for a foreword, so I limit myself to describing four results I found 
most surprising, then outline briefly what I think are the unique ways in which these 
results are adding to our understanding of human psychology. 


What Daily Life Methods Like ESM Say about Us 
The Social Nature of Humankind 


For me, perhaps the most important and unexpected result, confirmed by many of our 
studies and by those of colleagues elsewhere using these methods, has been the realiza- 
tion of how difficult it is to be alone. Granted, some individuals relish solitude but for 
most people the world over, from Korea, Japan, Europe, or North America, being with 
other people is almost always a better experience than being alone. 
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People feel happier, more active, more creative when eating with someone else than 
when eating alone; or when working or watching television, or walking outdoors. The 
only times people prefer being alone is when cleaning the house or reading a book. These 
findings are true not only for teenagers but also for senior citizens and all ages in between; 
they are equally true of men as of women, and of both the highly educated and the barely 
educated. 


The Default Condition of Our Brain 


These findings are in turn connected to an even more interesting and important pattern 
in the data: Whenever alone—or more broadly, when not engaged in any goal-directed 
task—most people begin to lose control of their mind. Or to say it less dramatically, their 
thought processes begin to unravel. Their concentration diminishes, they are less able to 
focus their mind, they start to ruminate on things that are going wrong, and as a conse- 
quence their moods tend to become increasingly despondent. 

For instance, in a recent review of decades of findings, we concluded: “When left 
to itself, the mind turns to bad thoughts, trivial plans, sad memories and worries about 
the future. Entropy—disorder, confusion, decay—is the default option of consciousness” 
(Hektner et al., 2007, p. 279). That this finding is counterintuitive can be deduced from 
the fact that as late as 2010, an article published in Science heralded a great discovery: 
evidence that the mind is inherently unstable, and that its wandering causes psychic dis- 
tress (Killingsworth & Gilbert, 2010). 


Monks versus Monkeys 


Actually, over 1,000 years ago Eastern thinkers remarked on how difficult it is to con- 
trol the contents of consciousness; the Buddhists called our organ of thought the “mon- 
key mind” because it keeps jumping from one perception to another, from sensation to 
thought to feeling, like a monkey in the jungle that jumps from one branch to another. 

The important implication of what the Buddhist philosophers observed, and con- 
temporary psychologists using the ESM and related methods are confirming, is that to 
live a meaningful, coherent life we must learn to control attention—to impose a pattern 
on the capers of the monkey mind. This can be accomplished through either the spiritual 
discipline that monks have developed in all the world religions or by finding tasks that 
engage the mind in enjoyable and useful activities (e.g., Csikszentmihalyi & Nakamura, 
2010). 


The Nature of Work 


One of the important changes in perspective that daily life methods allowed concerns 
the perception of work. The popular wisdom about work—the one-third of life we spend 
exchanging our time and energy for fungible pay to exchange for most of the things we 
want to own and use—is that it is something we need to endure, but only against our will 
and better judgment. Daily life methods reveal a more complex story. While it is true that 
most workers would rather be home than at their jobs, they actually feel more focused, 
creative, strong, and active when they are at work than when at home. In fact, with the 
right conditions, work is the most fulfilling aspect of adult life. Given the fine resolution 
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of the data it generates, the method of choice for finding out these “right conditions” is 
use of intensive momentary sampling tools. 

If we consider schooling as a form of “work,” then these methods can help enor- 
mously to identify ways of presenting information, ways of teaching that young people 
will find attractive and stimulating. In this one area, despite many decades of experimen- 
tal work, very little headway has been made. The process of sustained learning is impos- 
sible to replicate in a laboratory. If we wish to know how and when students focus their 
attention on schoolwork—and especially when they do so effortlessly—the ESM and 
related methods are again our choice. 


The Experience of Pain and Illness 


Another surprising finding for me has come from the moment-by-moment responses of 
individuals in various conditions of sickness. Especially our colleagues in Italy, studying 
paraplegia and chronic illnesses, and in The Netherlands, studying schizophrenia, have 
shown that as long as patients are focused on demanding tasks, they are less likely to feel 
the pain of their condition (De Vries, 1992; Larson & Johnson 1981; Peters et al., 2000). 
Our results also confirm such findings: Even women with breast cancer report less pain as 
long as they are interacting with others, or working, and more pain when they are alone 
or have nothing to do. Also, teenagers who spend more time in demanding intellectual or 
physical activities report less pain overall in young adulthood. In this respect, the ESM 
and other methods for studying daily life provide some extremely important suggestions 
for medical intervention that I hope will become implemented in the future. 


The Three Meta-Findings 


If I had to point at the most intriguing theoretical implications of these new methods, I 
would choose the following three. 

First is confirmation of the basic psychological unity of humankind. In the past cen- 
tury, the concept of relativity has held a hegemonic position in the studies of our species. 
Everything was seen to depend on time and place, on cultural context, on social condi- 
tion. Good and bad had to be qualified by provenance. Yet people all over the world— 
from Nepal and Hong Kong to Seoul and Kyoto, from Milan and Lisbon to Frankfurt 
and Chicago—tend to spend their time in amazingly similar ways, and feel about what 
they do in very similar ways. 

Second is confirmation of what the Buddhists have been saying about the volatil- 
ity of consciousness, and the importance of being able to take control of it. In a similar 
vein, the neurophilosopher Thomas Metzinger (2004) has been pointing out that what 
we believe to be a coherent self is actually a bricolage of events in consciousness that do 
not necessarily fit with each other; there is no homunculus inside the head to direct our 
lives. His best-known book, Being No One, gives a good idea of what he means. The 
implications of this perspective alone are important enough to keep ESM-style research 
busy until the end of this new century, and well into the next. 

Third is realization of the importance of connectedness among people. Even in the 
United States, the most individualistic among the cultures of the world, to be with others, 
doing things together and helping each other, among the best elements of life. Of course, 
the ESM also shows important differences in the way people from supposedly individu- 
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alistic as opposed to collectivist cultures feel when they are with other people. Yet these 
differences do not overshadow the basic fact: We all feel best in a social context. 


Do These Methods Solve All Our Problems? 


Of course not. Like any other methodology, the ESM and other daily life methods are 
best suited for some tasks, relatively good for many others, and unsuited for still other 
tasks. For example, what makes the ESM most valuable is also what makes it most 
vulnerable—namely, that it relies on personal accounts of subjective experience. If the 
respondents do not trust the research team, or if they want to create a false impression, 
ESM data become useless. It is therefore essential to develop a “research alliance” with 
respondents, to motivate them to be as transparent as possible. This is usually not too dif- 
ficult to achieve; and once the respondents are committed to help the study, they are not 
likely to provide misleading responses. But, of course, there are situations where respon- 
dents will be unusually suspicious: Workers in a factory may worry that their responses 
will fall into the hands of management and be turned against them; students might fear 
that the teacher will penalize them if she finds out how bored they are in class. And then 
there are situations where the ESM is too intrusive: For instance, business executives do 
not want the signal to interrupt their meetings, or lovers resent its interference during 
intimate moments, so at these critical times they turn the signaling device off. This vol- 
ume discusses these and several other limitations, as well as their solutions, in the use of 
the ESM and related methods. What is apparent is that today there are more choices than 
ever for selecting a research method that is most appropriate and practical for a research 
question and population. 

But despite inevitable shortcomings, what counts is that finally psychology has devel- 
oped serious scientific alternatives to laboratory experimentation; a set of methods that— 
in whatever final form they take—will provide unique insights into what it means to be a 
human being. This might sound presumptuous, but I hope that reading the chapters that 
follow will justify the claim. 


MIHALY CSIKSZENTMIHALYI, PhD 
School of Behavioral and Organizational Sciences 
Claremont Graduate University 
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Preface 


hen we started conceiving this project, like most (first-time) editorial parents, 

we had extremely high expectations. We wanted our handbook to be smart, 
knowledgeable, and helpful, and, over time, prove to be worthwhile and competent. We 
thought that psychology and related fields needed a resource that brought together exist- 
ing expertise on the use of various methods for the intensive measurement of aspects of 
daily life. Although the names of these methods vary (e.g., experience sampling, diary 
methods, ambulatory assessment, ecological momentary assessment), at their core they 
allow researchers to study experience, behavior, environments, and physiology of indi- 
viduals in their natural settings, in (close to) real time, and on repeated occasions. While 
these methods are not necessarily difficult to employ, there is considerable insider knowl- 
edge with respect to study design and analysis that is important for researchers to access. 
Therefore, our goal was to draw together eminent researchers to document their practical 
expertise in a single book. Of course, high (parental) expectations can lead to disap- 
pointment. But in this case, the project far exceeded our keenest expectations. Although 
we would like to attribute its success to good parenting, it probably has more to do with 
factors beyond our control, such as the timing of the handbook and the incredible lineup 
of high-caliber researchers willing to contribute to the project. 


A Well-Timed Birth 


We believe the Handbook is well timed for several reasons. First, the field of psychology 
recently started seeing a renewed commitment to real-world behavioral research (Baumeis- 
ter, Vohs, & Funder, 2007; Cialdini, 2009; Furr, 2009; Patterson, 2008; Rozin, 2009), 
which seems appropriate as we come out of the American Psychological Association’s 
(APA) “decade of behavior” (2001-2010). Extrapolating from the fact that neuroimaging 
research accelerated following the National Institute of Mental Health’s “decade of the 
brain” (1990-1999), one might expect real-world behavioral research to surge following 
the APA’s decade of behavior. Thus, our goal with this handbook was to anticipate this 
surge and provide a comprehensive guide to research methods that enable researchers to 
study real-world behavior. 

The timing of the Handbook is also excellent because there is now a wider range of 
naturalistic sampling methods from which to choose. There are methods to measure self- 
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reported experiences and behaviors, physiology, hormones, audible expressions, physical 
activity, posture, and environments. Thus, to be broad and inclusive in scope, the second 
section of the Handbook covers a range of these innovative methods, including ambula- 
tory physiological measurement, endocrinological assessment, and the sampling of other 
(non-self-reported) aspects of behaviors and environments. We believe that these novel 
ways of assessing objective (in the sense of traceable) aspects of daily life will prove to 
be valuable complements to daily life methods employing self-reports. We also strongly 
believe, though, that objective and subjective carry no connotation about measurement 
priorities; instead, they reflect two different yet complementary assessment perspectives 
(i.e., the one of the actor and the other of the observer). A comprehensive psychology of 
human behavior necessitates the dedicated and balanced study of both. At the same time, 
the Handbook is somewhat more focused on measurement of self-reported experiences 
and behaviors, reflecting their greater penetration in psychology and related fields. We 
expect that as these methods progress, future resources may be more equitable in their 
content. Thus, our aim was to provide a near-to-comprehensive guide to the range of 
methods available at this point in time. 


Interdisciplinary, International, and Practical Character 


We had three main goals for the Handbook. First, we wanted to “think interdiscipli- 
narily.” Although the Handbook is firmly grounded in the field of psychology, it was 
important to us that it reach out openly and broadly to neighboring fields also rooted in 
the study of daily life, such as behavioral medicine and psychiatry. At the same time, the 
Handbook is perhaps overly representative of social and personality psychology, reflect- 
ing our personal backgrounds as well as the distinctively social origins of many of these 
methods; however, as seen in the last section of the Handbook, applications in other 
major areas of psychology (health, clinical, developmental, industrial/organizational) are 
also reviewed substantively. 

Second, we wanted to “think internationally” and bridge continents. Our goal was 
to bring together contributors from a range of countries in which daily life research is rep- 
resented. As a result, our chapter authors come from nine countries and four continents: 
Austria, Canada, Germany, The Netherlands, Singapore, Switzerland, the United King- 
dom, the United States, and New Zealand. This international focus reinforces the goal of 
the Society for Ambulatory Assessment (SAA; www.ambulatory-assessment.org), which 
was founded in 2008 and grew out of the European Network of Ambulatory Assess- 
ment. Many contributors to the Handbook are part of the SAA—including the current 
president (Frank Wilhelm), all members of the Executive Committee, and many members 
of the Scientific Advisory Board. It is our hope that this handbook will work in tandem 
with the SAA to foster greater awareness of the large community of scholars dedicated to 
these methods, ultimately facilitating international collaborations. 

Finally, we sought to “think practically” to maximize usability of the Handbook. 
Trying to be more practical than theoretical in focus, the Handbook includes a strong 
section on data analysis of intensive repeated-measures data, as well several other how-to 
chapters (e.g., on designing a daily life study and on data cleaning). The Handbook also 
boasts several unique, applied chapters currently not found in any published resource 
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(e.g., chapters on psychometrics, power analysis, data cleaning, missing data analysis, 
and measurement reactivity). 


Choosing a Name 


Of course, not everything around the Handbook was smooth sailing. Unexpectedly, 
one of the hardest issues was choosing its name. Currently, no widely agreed-upon term 
comprehensively captures the range of methods in this handbook. Originally the project 
was conceived of as the Handbook of Experience Sampling Methods, with experience 
sampling referring in the broadest sense to any intensively repeated measure obtained in 
daily life. This term was initially thought to be a good title choice because most research 
psychologists would immediately be familiar with it. However, by emphasizing experi- 
ence it failed to accommodate non-self-reported forms of assessment (e.g., observational 
assessments, ambulatory cortisol or blood pressure monitoring). Moreover, experience 
sampling still conveys a specific historical root and a particular method involving the ran- 
dom sampling of moments throughout the day via a signaling device (Hektner, Schmidt, 
& Csikszentmihalyi, 2007). Other superordinate terms have been suggested and used, for 
example, diary methods (Bolger, Davis, & Rafaeli, 2003). However, here, too, the term 
appeared overly restrictive. Semantically, it implies the use of a diary and thereby again 
leaves out non-self-reported data. Yet another term, and a very widely used one, is eco- 
logical momentary assessment (EMA). EMA originated in behavioral medicine and was 
coined specifically to encompass both psychological and physiological assessments (Stone 
& Shiffman, 1994). Although it appears conceptually sufficiently broad, it is highly iden- 
tified with the medical field and emphasizes one historical root over at least two others. 
Finally, the term is not technically adequate to cover important daily life methods that 
collect data not “in the moment” but effectively in close to real time (e.g., end-of-day 
recall using daily diaries). Still, another superordinate term is ambulatory assessment 
(Fahrenberg, Myrtek, Pawlik, & Perrez, 2007). The term was originally coined by Jochen 
Fahrenberg and Michael Myrtek (1996) and is widely recognized in Europe. Initially, it 
referred to the use of device-assisted in situ measurement of physiological outcomes but 
was later expanded to include behavioral and psychological outcomes (www.ambula- 
tory-assessment.org). Nonetheless, this term is still relatively unfamiliar to a broad group 
of daily life researchers, mostly in North America. Also, not all methods are ambulatory 
per se. A once-a-day diary in which people complete their reports at home over the Inter- 
net is, technically, stationary. A similar argument holds for some of the latest technologi- 
cal developments in ubiquitous computing (e.g., MIT’s PlaceLab and Human Speechome 
projects). 

Therefore, the challenge was to choose a title (1) that research psychologists could 
readily identify with and (2) that accurately captures the range of methods represented 
in the Handbook. At the very core, these methods involve intensive repeated measure- 
ments in naturalistic settings (IRM-NS)—a term used by Moskowitz, Russell, Sadikaj, 
and Sutton (2009). Admittedly, we opted against IRM-NS in the title for the arguably 
unscientific reason that it is something of a tongue-twister. In light of these complexities 
(which kept us pondering for weeks and made us smile quite a bit), we decided to give 
our handbook the sufficiently broad name Handbook of Research Methods for Study- 
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ing Daily Life. Importantly, though, we made no stipulations whatsoever with respect to 
what terms contributors should use to describe these methods in their chapters. (We only 
asked them to define how they would be using their preferred term.) In taking this laissez- 
faire approach, we may have missed an opportunity to propose the definitive name onto 
which the field could (but almost surely would not) ultimately settle. Although the result 
is not as confusing as it might seem, we acknowledge that a consensus name would be a 
good future aim. 


The Blessing and Curse of Technology 


Although we intended this handbook to be state of the science and cover the latest devel- 
opments, it is inherent to this field of research that some information—particularly tech- 
nological developments—changes greatly and rapidly. In fact, one of the chapters in this 
volume that is probably most useful to researchers, which reviews current hardware and 
software for self-report-based data collection (see Kubiak & Krog, Chapter 7), is at the 
same time most susceptible to rapid knowledge turnover. 

No crystal ball is needed to foresee that the future of daily life research will hold (1) 
more infusion of technology into the methods of data collection and (2) more diffusion 
of data collection into people’s natural use of their own (mobile) technology (Lazer et al., 
2009). Beyond that, those two trends will almost certainly facilitate a third one—the col- 
lection of data in increasingly larger samples. Whereas initially daily life research was de 
facto constrained to small-sample research, the ever-decreasing cost and ever-increasing 
user-friendliness and ubiquitousness of mobile technology will probably lead to a new 
generation of daily life studies in which thousands or tens of thousands of participants 
provide information about their everyday experiences, environments, and behaviors (see 
Intille, Chapter 15, this volume; see also Killingsworth & Gilbert, 2010). 

The field of daily life research should be proud of the degree of technological sophis- 
tication and methodological penetration it has achieved within less than half a century. 
Yet it is exactly because of this immense progress that our handbook, in the end, falls 
(somewhat) short of comprehensiveness. Although a wide methodological terrain is cov- 
ered, a few topics ended up left out (and left to future volumes; e.g., day reconstruction 
method, sampling of cognitive variables in situ). The covered analytic methods in Part 
II and research fields in Part IV are a broad selection but not an exhaustive list. Yet we 
are confident that this handbook will prove to be a great friend to researchers who are 
interested in or already actively conducting their own research in daily life. 


Expressing Gratitude 


Finally (and returning to the parenting metaphor), we would like to express our deep grati- 
tude to everyone who directly or indirectly helped with giving birth to this handbook. We 
thank all of the authors for agreeing to be part of this endeavor and for contributing such 
rich and unique chapters. We feel humbled by the incredible lineup of the world’s lead- 
ing experts. We also thank the members of our Editorial Advisory Board—namely, Niall 
Bolger, Jochen Fahrenberg, Jean-Philippe Laurenceau, Harry T. Reis, Arthur A. Stone, 
and Howard Tennen—for providing the senior weight and backbone to the Handbook 
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and for supporting it with both hands and hearts. We also thank our team at The Guil- 
ford Press, specifically Seymour Weingarten and Carolyn Graham, for their help from 
the Handbook’s conception to publication. Without Seymour’s vision, persistence, and 
persuasion, this handbook would not have happened. Carolyn’s practical guidance car- 
ried us through the publication process smoothly and efficiently. Last, we want to thank 
our academic mentors, Jamie Pennebaker and Sam Gosling (MRM) and Lisa Feldman 
Barrett and Howard Tennen (TSC), to whom we are deeply grateful for their patience 
and training. Like good parents, they were always there when we needed them, but they 
encouraged us to venture out and grow through our own experiences in the science of 


daily life. 


TAMLIN S. CONNER 
MATTHIAS R. MEHL 
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THEORETICAL BACKGROUND 


CHAPTER 1 


Why Researchers Should Think 
“Real-World” 


A Conceptual Rationale 


HARRY T. REIS 


ow much time do parents spend with their children of varying ages? Are people 

more likely to drink, smoke, or argue after a stressful day at work? Are women more 
talkative than men? Do emotional experiences change body chemistry? Does television 
watching really dull the mind? Does physical activity promote emotional well-being? Do 
people eat differently when away from home, or when others are around? How does air 
temperature affect activity and mood? Which kinds of social contact are shy persons 
most likely to avoid? Do antidepressant medications increase the quality of everyday life? 
How is behavior affected by the physical settings in which we live, work, and play? 

Methods for studying daily life experiences have arrived, fueled by questions of this 
sort and new technologies. A recent search on Medline and PsycINFO revealed well over 
2,000 published papers using some of the more common examples of these methods. 
Daily life experience methods are familiar, albeit not yet standard, tools in several litera- 
tures (e.g., medicine and health, emotion, social and family interaction). In the National 
Institutes of Health’s Healthy People 2020 initiative, Bachrach (2010) highlighted these 
methods among the “tools that can revolutionize the behavioral and social sciences,” not- 
withstanding the fact that “researchers are still in the earliest stages of tapping into [their] 
vast potential.” Similarly, Croyle (2007) describes methods for real-time data capture 
as critical tools for improving the accuracy and usefulness of self-reports in biomedical 
research. Moreover, new technologies (as described throughout this handbook) promise 
to increase dramatically the scope and accessibility of these methods. In short, there is 
every reason to expect that daily life research methods will become more influential in 
the near future. 

There is some flexibility in what counts as a method for studying daily life, but most 
existing examples fall into one of two broad categories. The first, and more common, 
category includes self-reports of behavior, affect, and cognition, collected repeatedly over 
a number of days, either once daily (so-called daily diaries; see Gunthert & Wenze, Chap- 
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ter 8, this volume) or sampled several times during the day. These include the two best- 
developed daily life methods, the experience sampling method (ESM—Csikszentmihalyi, 
Larson, & Prescott, 1977; Hektner, Schmidt, & Csikszentmihalyi, 2007) and ecological 
momentary assessment (EMA—Shiffman, Stone, & Hufford, 2008; Stone & Shiffman, 
1994), as well as event-contingent sampling (see Moskowitz & Sadikaj, Chapter 9, this 
volume), which is triggered by particular events (e.g., social interaction, sexual activity, 
exercise, or cigarette smoking). The second and newer category includes more technically 
sophisticated methods for capturing diverse, non-self-reported aspects of everyday expe- 
rience, such as the auditory environment (see Mehl & Robbins, Chapter 10, this volume), 
psychophysiological status (see Schlotz, Chapter 11, and F. Wilhelm, Grossman, & Miil- 
ler, Chapter 12, this volume), physical location (see Goodwin, Chapter 14, and Intille, 
Chapter 15, this volume), and proximity to particular other persons. Clearly these meth- 
ods cover a diverse range of phenomena studied by behavioral and medical scientists. 

Daily life protocols are intended to “capture life as it is lived” (Bolger, Davis, & 
Rafaeli, 2003, p. 580)—that is, to describe behavior as it occurs within its typical, spon- 
taneous setting. By documenting the “particulars of life” (Allport, 1942), these methods 
provide extensively detailed data that can be used to examine the operation of social, 
psychological, and physiological processes within their natural contexts. A key premise 
of the daily life approach is that the contexts in which these processes unfold matter—in 
other words, that context influences behavior, and that proper understanding of behavior 
necessarily requires taking contextual factors into account. As the accessibility and popu- 
larity of daily life methods have increased, so too has researchers’ ability grown in both 
range and complexity to ask and answer important questions about behavior. 

The rationale for daily life measures is often couched in methodological terms; for 
example, that they eliminate retrospection bias or minimize selectivity in describing 
experiences (see Schwarz, Chapter 2, this volume). To be sure, these are important advan- 
tages, especially in those topical areas that must rely on self-reports (e.g., when the indi- 
vidual’s personal experience is the focus of research). Nevertheless, as I argue later in this 
chapter, the conceptual advantages of daily life protocols provide an equally, if not more, 
compelling justification for their implementation. Daily life methods allow researchers 
to describe behavior as it occurs in natural contexts—a fundamental difference from 
investigations based on global self-reports or on behavior in the laboratory (Reis, 1994), 
perspectives that presently predominate in the behavioral science literature. Thus, daily 
life methods make available a different kind of information than traditional methods do, 
information that provides a novel and increasingly valuable perspective on behavior. The 
conceptual benefits of daily life methods are as important a reason for their growth as 
their methodological benefits. 

My goal in this chapter is to present the conceptual case for why researchers should 
consider adding daily life methods to their methodological toolbox. I begin by discussing 
the kind of information that daily life methods provide, highlighting ways in which they 
complement more traditional methods. Following this, the chapter reviews in turn three 
conceptual bases for daily life research: ecological validity, the value of field research, and 
the need to take context seriously. Next, I describe the role of daily life data in description 
and taxonomies, a step of theory building that in my opinion has been underemphasized 
in the behavioral sciences. The chapter concludes with a review of the place of daily life 
methods in research programs. An overarching goal of this chapter is to provide a context 
for the remainder of this handbook. My hope is that greater appreciation of why these 
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methods are valuable for substantive research will make the what and how of subsequent 
chapters more compelling. 


What Kind of Information Do Daily Life Methods Provide? 


Let me begin by being clear about one thing: Self-reports are here to stay. There is infor- 
mation that no one but the respondent knows (Baldwin, 2000), including motives, goals, 
emotions, and thoughts, all of which are important and influential phenomena in their 
own right, which is why many theories about human behavior and interventions focus on 
them. Nevertheless, researchers and practitioners are often skeptical about self-reports; 
for example, Stone and colleagues comment, “It is naive to accept all self-reports as 
veridical” (2000, p. ix). Over the years, many methods have been developed to try to 
improve the accuracy of self-reports, most of which at best have had limited success. 
Daily life measures are still self-reports, of course, but as discussed by Schwarz (Chapter 
2, this volume), they often represent a substantial improvement over more common ret- 
rospective methods. 

Researchers have little disagreement that retrospective responses to survey ques- 
tions, even when those surveys are well designed and carefully executed, can be biased 
(Schwarz, Chapter 2, this volume; see also Schwarz, 2007; Tourangeau, 2000). Wentland 
and Smith (1993) meta-analyzed a series of studies that included objective criteria against 
which the accuracy of self-reports could be evaluated. Across diverse topics and ques- 
tions, accuracy ranged from 23 to 100%. It seems patently obvious that survey responses 
would be affected by the limits of human memory (Tourangeau, 2000); for example, few 
survey respondents likely can remember what they were doing on Tuesday, June 20, 1995, 
how frequently they bought lunch in their high school cafeteria, or how they felt after a 
trip to the dentist 5 years ago. Of course, accuracy issues of this sort pertain only to the 
kinds of variables and processes that people are able to self-report in the first place—that 
is, personal experiences and events. Daily life measures are also used to study variables 
about which people are unlikely to have access even when they occur (e.g., psychophysi- 
ological states), or to which people are unlikely to attend unless directed by researchers 
(e.g., ambient attributes of the physical environment). For these, retrospective surveys are 
not feasible. 

It would be simple-minded to assume that because of the fallibility of memory, 
retrospective surveys are “wrong” and indices constructed from daily life accounts are 
“right.” Rather, when properly investigated, each should be considered a valid indicator 
of experience viewed from a given perspective (Reis, 1994). Retrospective surveys con- 
cern reconstructed experience; they characterize circumstances from the person’s current 
vantage point, reflecting the various cognitive and motivational processes that influence 
encoding, storage, retrieval, and assessment of episodic memories (Tulving, 1984; Visser, 
Krosnick, & Lavrakas, 2000). Daily life measures, in contrast, tap ongoing experience, 
or contemporaneous accounts of activity (often obtained in or close to real time) and the 
person’s feelings about that activity. Both types of data are relevant to understanding 
human behavior. Researchers often want to know what actually happened, but some- 
times they also want to know how people experience or understand events in their lives, 
given time to reflect on them—what McClelland, Koestner, and Weinberger (1989) called 
“systematic experience-based self-observation” (p. 700). In fact, because the transforma- 
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tional processes by which actual experiences are reinterpreted represent some of the most 
compelling phenomena in the behavioral sciences, comparisons of real-time and recol- 
lected responses can be particularly informative (Reis, 1994). This, of course, requires 
both kinds of measures. 

Consider, for example, a program in which researchers are interested in identifying 
emotional consequences of social isolation among older adults. A survey researcher might 
ask participants, “How much social contact have you had within the past 2 weeks?”, 
using anchors ranging from None at all to A great deal. Daily life researchers would likely 
argue, with good reason (as explained below and in later chapters of this handbook), that 
answers to this question are unlikely to correspond more than modestly (at best) with 
either subjective daily life indicators, such as reports from random or event-contingent 
diaries, or objective daily life indicators, such as might be obtained from video or audio 
records or from sensors placed in the home, or worn by participants on their apparel. On 
the other hand, there is good reason to believe that answers to a longer-term retrospective 
question (e.g., across 2 weeks) will reflect the older adult’s perceived experience of inad- 
equate social contact, a key component of dysfunctional loneliness (Cacioppo & Patrick, 
2008). Neither measure is inherently better than the other. By combining both kinds of 
measures within a study, researchers might identify circumstances in which objective 
social contact is relatively frequent yet unfulfilling, as well as circumstances in which 
social contact is relatively sparse yet the individual nevertheless feels sufficiently con- 
nected to others. This sort of integration is likely to answer important questions about 
how life events are experienced. 

In short, it is apparent that the methodological advantages of daily life methods con- 
tribute to their growing popularity (reflected throughout this handbook). The justifiable 
basis for such enthusiasm notwithstanding, researchers should remain mindful of the fact 
that momentary reports are still self-reports, and therefore are subject to construal by the 
respondent. Real-time, momentary reports of experience cannot substitute for retrospec- 
tive accounts if the individual’s reflections on his or her experience are the subject matter 
of investigation. For this reason, then, daily life measures should be considered a comple- 
ment to retrospective measures, rather than a substitute for them. This logic suggests that 
the conceptual rationale for daily life measures may matter more than the methodologi- 
cal rationale. The remainder of this chapter describes this rationale. 


Ecological Validity 


The term ecological validity refers to whether a study accurately represents the typical 
conditions under which that effect occurs in the real world. This definition derives from 
Brunswik’s (1956) principle of representative design, in which he argued that experiments 
must use representative samples of subjects and conditions in order to be generalizable.! 
Brewer (2000) characterizes ecological validity (which she calls “representativeness”) as 
one of three primary criteria for external validity, or “whether an effect (and its underly- 
ing processes) that has been demonstrated in one research setting would be obtained in 
other settings, with different research participants and different research procedures” 
(p. 10). Brewer’s two other criteria for external validity are robustness, or whether find- 
ings are replicated in different settings with different samples, or in different historical 
or cultural circumstances; and relevance, or whether the findings can be used to change 
behavior in the real world. 
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Researchers have long debated the relative priority of internal and external valid- 
ity. This debate has emphasized the ecological validity component of external validity, 
inasmuch as replication and translation into practice are seldom considered controver- 
sial. On one side of this debate, researchers may lament the low priority often ascribed 
to representativeness (e.g., Helmreich, 1975; Henry, 2008; McGuire, 1967; Ring, 1967; 
Silverman, 1971). On the other side, researchers argue that because laboratory research is 
conducted to evaluate theories under carefully controlled conditions, questions about the 
applicability of those studies to real-world circumstances are more or less irrelevant—in 
other words, experiments are done to determine “what can happen” as opposed to “what 
does happen” (e.g., Berkowitz & Donnerstein, 1982; Mook, 1983; Wilson, Aronson, & 
Carlsmith, 2010). In the biological and physical sciences, researchers deliberately create 
unrepresentative conditions in order to examine the operation of particular mechanisms 
under controlled (but theoretically informative) conditions (e.g., observing the behavior 
of electrons in a vacuum). It is reasonable to assume that controlled conditions could be 
similarly informative for behavioral theories (Petty & Cacioppo, 1996). 

For this and other reasons, students in the experimental behavioral sciences are usu- 
ally taught that internal validity has higher priority than external validity—that it is more 
important to be certain that an independent variable is the true source of changes in a 
dependent variable than to know that research findings can be generalized to other sam- 
ples and settings. For example, in one of the most influential methods volume of the 20th 
century, Campbell and Stanley described internal validity as the sine qua non of valid 
inference, while commenting that the question of external validity is “never completely 
answerable” (1966, p. 5). I do not disagree with this rank ordering of internal and exter- 
nal validity. Too often, however, the lesser priority of external validity is taken to mean 
low (or even no) priority, or, in other words, that external validity is of little concern. This 
can hardly be correct. If a process or phenomenon does not occur in the real world, how 
important can it be? And, perhaps more pointedly, if real-world conditions modify the 
operation of a process or phenomenon, would it not be important for the relevant theories 
to consider and incorporate those moderator variables? (See Cook & Groom, 2004, for 
a related discussion.) 

Daily life protocols begin with the premise that ecological validity matters, in the 
sense that by studying behavior within its natural, spontaneous context (hence the name 
ecological momentary assessment; Stone & Shiffman, 1994), generalizability of settings 
and conditions is inherently less of an issue here than in laboratory research. To be sure, 
this will not always be the case. Studies conducted in very unusual settings (e.g., the 
National Science Foundation research station in Antarctica) might have little generaliz- 
ability to other settings. Studies using invasive technology (e.g., placing prominent video 
cameras throughout the home, or having participants wear cumbersome physiological 
monitors) might alter settings sufficiently to nullify their representativeness. Ecological 
validity, in other words, is not guaranteed by the use of daily life methods but it reflects 
the correspondence between the conditions of a study and the conclusions that are drawn 
from it. 

By observing phenomena in their natural contexts, without controlling other influ- 
ences, behavioral processes can be investigated within the full complement of circum- 
stances in which they are most likely to occur. Consider, for example, the possibility that 
alcohol consumption often takes place in the presence of others who are also drinking. A 
laboratory study, depending on its design, might not differentiate effects of drinking in 
social and solitary settings; a study using daily life methods would do so (e.g., Mohr et 
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al., 2001), thereby providing information about alcohol consumption that better reflects 
the way in which people actually drink. As discussed below, the laboratory context some- 
times creates conditions that are rare in normal experience. 

There are several other reasons why daily life protocols may have greater ecologi- 
cal validity than other protocols. For one, daily life studies can examine the nature and 
repercussions of events that cannot ethically or pragmatically be studied in the labora- 
tory, such as health crises or abusive behavior in families. Of course, these events can 
be studied retrospectively, but such findings may be distorted by methodological biases, 
such as those reviewed by Schwarz (Chapter 2, this volume), as well as by suggestibility 
and lay theories about these events (e.g., Loftus, 2000; Ross, 1989). Another reason is 
that daily life methods are well suited to tracking how behavioral processes unfold over 
time; for example, how people adapt to divorce or chronic illness (Bolger et al., 2003). 
As mentioned earlier, retrospective accounts of change over time may be influenced by 
lay theories of change. Daily life measures, in contrast, assess change in real time, and 
are also sensitive to contextual factors that covary with adaptation to such events (e.g., 
divorce and chronic illness are often accompanied by changes in financial status and pat- 
terns of family interaction). A third and final reason is that real-time daily life measures 
typically assess respondents while they are physically located in the focal behavioral set- 
ting. Retrospective reports, in contrast, are usually obtained in different locales. Proper- 
ties of the physical environmental (including others present) can influence self-reports 
and behavior. 

Of course, ecological validity in daily life studies does not come without a cost, 
and that cost is typically less internal validity. This is most clearly the case in correla- 
tional (nonexperimental) designs, in which the target variables are tracked or recorded 
for some period of time, then correlated in theoretically relevant ways. The vast majority 
of published daily life studies rely on correlational designs, although there are also many 
true experiments (i.e., studies in which participants are randomly assigned to different 
conditions) and quasi-experiments (i.e., designs that include controls for certain potential 
artifacts of correlational approaches) (Campbell & Stanley, 1966). In these cases, internal 
validity fares better, although there still may be significant loss due to the inability to 
standardize the participants’ environment. 

Whatever one’s position on these issues, debates about the relative importance of 
internal and external validity obscure a more fundamental point. No single study can 
minimize all threats to internal validity while simultaneously maximizing generaliz- 
ability. Internal validity requires careful control of context, whereas external validity 
requires letting contexts vary freely. Because all methods have advantages and draw- 
backs, the validity of a research program is most effectively established by methodologi- 
cal pluralism—using diverse paradigms, operations, and measures to triangulate on the 
same concepts (Campbell, 1957; Campbell & Fiske, 1959). Laboratory settings are suit- 
able for carefully controlled studies, because manipulations can be crafted there to test 
specific theoretical principles while controlling real-world “noise” and ruling out alter- 
native explanations and potential artifacts (e.g., those factors that covary in natural set- 
tings with the key independent variable). Daily life studies complement laboratory studies 
by illustrating processes in more realistic, complex settings, thereby demonstrating the 
nature and degree of their impact. 

The significance of this double-barreled approach goes beyond showing that pro- 
cesses established in laboratory research are also evident in the real world (a goal that 
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most scientists would find unambitious). Brewer expressed this idea succinctly: “The kind 
of systematic, programmatic research that accompanies the search for external valid- 
ity inevitably contributes to the refinement and elaboration of theory as well” (2000, 
p. 13). In other words, validity, in the broadest sense of that term, depends on matching 
protocols, designs, and methods to questions, so that across a diverse program of stud- 
ies, plausible alternative explanations are ruled out, important boundary conditions are 
determined, and the real-world relevance of a theory is established. Thus, the proper role 
of daily life research is not so much to provide findings that stand on their own as it is to 
contribute to methodologically diverse research programs that advance the depth, accu- 
racy, and usefulness of science-based knowledge and interventions. 


The Value of Field Research 


Kurt Lewin, the father of modern social psychology, is widely known for his appreciation 
of social action field research. Lewin felt that field experiments would help researchers 
keep in touch with the real-world implications of their theories, countering a “peculiar 
ambivalence [of] ‘scientific psychology’ that was interested in theory ... increasingly to 
stay away from a too close relation to life” (1951, p. 169). In the half-century that fol- 
lowed, social psychology and related fields blossomed, mostly on the back of laboratory 
experimentation. No doubt researchers gravitated to the laboratory because of its many 
benefits, including experimental control over variables, settings, and procedures, which 
allowed researchers to control extraneous influences and thereby maximize internal 
validity, as well as the convenience of undergraduate samples. Field experiments did not 
disappear, but they were at best an occasional presence in leading journals. 

The advantages of laboratory experimentation have a price, however, in terms of 
increasing distance from Lewin’s “close relation to life.” Laboratory settings by defini- 
tion isolate research participants from their everyday concerns and activities, and subject 
them to an artificial environment in which nearly all contextual factors—for example, 
physical features, goals, other persons involved, and even the possibility of getting up and 
doing something else—are determined by the experimenter. In field settings, in contrast, 
the physical and social environment is substantially more cluttered: People must continu- 
ally contend with multiple stimuli that compete for attention; they must choose for them- 
selves which tasks to pursue and how to engage them; and the option of changing settings 
or tasks is usually available. All of these can, of course, alter the results of research. 

Weick (1985) makes a compelling case for the value of considering Lewin’s “close 
relation to life” in interpreting the findings of research. Which of the following situations, 
he asked, gets “closer” to the human condition: a study of how subjects in a laboratory 
experiment tell a new acquaintance that she is about to receive a mildly painful electric 
shock, or a study of how a coroner announces death to next of kin; anticipating a mild 
electric shock in a controlled laboratory setting or learning how to work on high steel ina 
21-story building; or, predicting the sequence in which light bulbs will light up or betting 
a week’s salary on the spin of a roulette wheel? Weick argued that “distance from life” 
encourages ambiguity and subjectivity in behavior, and thereby reduces the informative- 
ness of research. 

Field settings do not guarantee “closeness to life,” of course. Field settings can be 
trivial and uninvolving, just as laboratory settings can be consequential and intensely 
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engaging. (This is reminiscent of the distinction between mundane realism, or the extent 
to which the events of an experiment resemble real-world events, and experimental real- 
ism, or the extent to which an experimental scenario is involving; Wilson et al., 2010). 
Increasingly, however, laboratory studies command relatively little engagement from par- 
ticipants (Baumeister, Vohs, & Funder, 2007), a trend that seems likely to continue given 
progressively more stringent ethical limitations hampering researchers’ ability to create 
scenarios that maximize attention and motivation. In contrast, carefully selected field set- 
tings can maximize engagement with little or no intervention by researchers. Compare, 
for example, the results of laboratory studies in which undergraduates rate pictures of 
hypothetical dates with studies based on actual interactions in a speed-dating context 
(Finkel & Eastwick, 2008). By being “closer to life,” then, field studies can make the 
research setting absorbing and personally meaningful, thereby better illuminating human 
motives, defenses, affects, and thought processes. 

It bears noting that the rationale for studying daily life experience does not assume 
that the events or time periods under scrutiny are intense or profound. Just the opposite 
is true, in fact: Everyday life activities are often so mundane and uncompelling that they 
slip under the radar of conscious awareness. (For example, how many times did you nod 
or say hello to an acquaintance yesterday?) To capture them, methods based on random 
sampling of moments are needed, such as ESM or EMA, because methods based on recol- 
lection and selection would likely lead participants to overlook the occurrence or details 
of very ordinary experiences. By focusing on random samples of the “little experiences of 
everyday life that fill most of our working time and occupy the vast majority of our con- 
scious attention” (Wheeler & Reis, 1991, p. 340), daily life methods bring research “closer 
to life,” not because the participant’s attention has been galvanized but because natural 
activity has been observed. Representativeness is thus a key part of the rationale for field 
research. Theories of human behavior based solely on deeply meaningful, highly absorb- 
ing activities and major life events would surely neglect much of human experience. 

Field research, especially field experimentation, is often equated with replication or 
application; that is, some researchers conduct field experiments to determine whether a 
phenomenon or process established in the laboratory also occurs in natural settings or, 
alternatively, can be applied to yield personal or social benefit. Although these purposes 
are surely valuable, they disregard the potential role of field research in theory develop- 
ment. Field settings are ideal for identifying an effect’s boundary conditions and mod- 
erators. For example, the impact of a given variable may be enhanced or offset by other 
variables present in natural contexts but not in the controlled confines of the laboratory 
(Mortenson & Cialdini, 2010). Similarly, processes or phenomena may be influential 
among certain classes of individuals but not others. Perhaps ironically, then, the abil- 
ity to control extraneous influences that gives laboratory experimentation much of its 
enviable power and precision may mask circumstances that affect the operation of basic 
behavioral processes within their likely natural conditions (Reis, 1983). On this basis, 
Mortenson and Cialdini (2010) advocated a “full cycle” approach to theory development: 
using laboratory experiments to refine theories, and using field studies to establish the 
nature and impact of these theories when experimental control is relinquished and natu- 
ral circumstances are allowed to prevail. 

Consider the following example. Laboratory experiments have established that 
exposure to violent media increases the tendency toward aggressive behavior in uncon- 
strained social interactions (see Wood, Wong, & Chachere, 1991, for a review). Simple 
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laboratory experiments comparing exposure and no-exposure control groups are unri- 
valed in their ability to control extraneous sources of variance and to support a causal 
explanation for this effect. What these experiments do not indicate is whether this effect 
occurs when attention is divided (e.g., by text messaging, homework, or the presence of 
others), a natural circumstance of everyday media exposure. Do other experiences in the 
person’s life, such as school, friendship, or family interaction, play a moderating role? 
Does the impact of media exposure vary when approving peers or disapproving parents 
are present? Are preexisting or chronic affective states influential? Do different types of 
violent media have differential effects? Do men and women, or aggression-prone and 
non-aggression-prone people, respond more or less strongly to media violence? Do selec- 
tion biases determine who chooses to watch violent media? Daily life studies can address 
such questions and, on the basis of the evidence they provide, researchers might conduct 
further experimentation to consider causal mechanisms. In this way, laboratory experi- 
mentation and daily life studies conducted in the field can play complementary roles in 
advancing theories. 

Field research can also play another, more innovative role in theory development, 
namely, to “scout out” new effects (Mortenson & Cialdini, 2010), that is, to suggest new 
processes and hypotheses worthy of further investigation. Daily life data are particularly 
well suited to “discovery-oriented” research (as contrasted with hypothesis testing). Daily 
life datasets tend to be large and rich in detail and description, affording ample opportu- 
nities for creative exploration and data mining—sorting through large amounts of data 
to identify complex, not readily apparent patterns of association. With suitably large 
datasets and increasingly sophisticated statistical procedures, it is possible to uncover 
important regularities that lead to theoretical or applied advances. Once identified, more 
traditional approaches can be used to verify and elaborate these discoveries. 

A commonly cited advantage of field studies is that research participants may be 
unaware of being observed, thereby minimizing reactivity effects (Kimmel, 2004; Reis & 
Gosling, 2010). Unfortunately, this tends not to be the case in daily life studies, inasmuch 
as such studies require participants either to record information about current events 
(e.g., ESM, EMA) or to carry with them ambulatory recording devices. One way in which 
daily life researchers can minimize such effects is to emphasize the cooperative, nonde- 
ceptive intent of daily life research. Furthermore, by providing a brief adaptation period 
at the beginning of a study, people often become accustomed to protocols, minimizing 
reactivity effects. Reactivity effects are discussed in more detail by Barta, Tennen, and 
Litt (Chapter 6, this volume). 

Finally, field research commonly provides access to larger, more diverse samples than 
does laboratory research. One reason why experimentation with college students became 
popular is the logistical difficulty of recruiting nonstudent samples to participate in labo- 
ratory studies (Sears, 1986). Time, availability, convenient access, and cost all favor the 
use of college students as research participants. With daily life studies, researchers have 
less incentive to prefer student samples over more diverse samples. 

It is important to remember that the setting in which a study is conducted is inde- 
pendent of whether that study is experimental or nonexperimental. As mentioned ear- 
lier, daily life studies tend to use correlational designs, whereas Lewin-inspired social 
action research tends to be experimental or quasi-experimental. Nevertheless, daily life 
measures are readily adapted to experimental designs. For example, daily life measures 
can serve as outcomes in field experiments, such as to quantify everyday experience for 
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participants randomly assigned to an intervention condition or a control group. New 
technologies developed for daily life studies can also deliver experimental interventions. 
For example, Heron and Smyth (2010) review the results of ecological momentary inter- 
ventions (EMIs)—interventions used to treat problems, such as smoking, anxiety, or eat- 
ing disorders, that are delivered in ambulatory, time-relevant contexts by using palmtop 
computers or other mobile devices (Chapters 9-17 of this handbook discuss these tech- 
nologies and their application). In summary, recent advances in ambulatory technology 
provide increasingly flexible tools for conducting experiments in field settings, allowing 
researchers to avail themselves of the advantages of experimentation and field research 
simultaneously. 


Taking Context Seriously 


The impact of context on behavior is fundamental. Ever since the pioneering research of 
Roger Barker (1968; Barker & Wright, 1954), most behavioral scientists have acknowl- 
edged that context affects behavior. Barker believed that to understand behavior, one 
had to first understand what sorts of behavior the setting—its context—was likely to 
evoke. Thus, he called on researchers to identify regularities in the properties of behavior 
settings (e.g., homes, classrooms, medical offices, or roadways) and the behavioral pat- 
terns that they evoked. Barker’s proposition, widely accepted throughout the behavioral 
sciences, is particularly evident in two subdisciplines: environmental psychology, which 
studies the influence of the built and natural environment on behavior (Proshansky, Ittel- 
son, & Rivlin, 1976; Stokols & Altman, 1987), and social psychology, which studies how 
the psychological properties of situations influence behavior (Ross & Nisbett, 1991). 

Daily life research takes context into account in one of three ways. First, some stud- 
ies seek to control context effects by assessing behavior in its natural (presumably, rep- 
resentative) context rather than in specialized environments. For example, blood pres- 
sure can be elevated when it is assessed in a doctor’s office—the so-called “white coat 
syndrome”—suggesting the value of collecting ambulatory readings before prescribing 
medications to lower blood pressure (WebMD, 2010). Second, daily life research may 
assess context and behavior simultaneously, so that associations can be identified. Csik- 
szentmihalyi and colleagues (1977) developed the ESM to examine “fluctuations in the 
stream of consciousness and the links between the external context and the contents of 
the mind” (Hektner et al., 2007, p. 6). Thus, many of their studies examine affective 
states among adolescents as a function of what they are doing. For example, flow (a 
mental state in which people are fully and energetically immersed in whatever they are 
doing) tends to be low among adolescents in many school activities and while watching 
television. Third, new technologies allow researchers to ask context-sensitive questions 
(Intille, 2007; Chapter 15, this volume). For example, accelerometers (which identify 
motion patterns) let researchers prompt participants to record their thoughts or feelings 
upon awakening or completing exercise. Similarly, questions tailored to the participant’s 
location can be administered on the basis of readings from global positioning devices 
(e.g., on a crowded city street, at home, or in nature). 

Laboratory experimentation sometimes does not consider the extent to which the 
laboratory setting itself may contribute to the outcomes of research. This seems ironic; if 
settings had no influence on behavior, why would they need to be controlled? Every labora- 
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tory has unique physical features, but beyond this, the laboratory setting itself may engen- 
der certain expectations and scripts (e.g., scientific legitimacy, serious purpose, suspicion 
about possible deception, concerns about being observed, the need for attentiveness), all 
of which may affect the participant’s thoughts and behavior (Shulman & Berman, 1975). 
One example of this, demand characteristics (cues that suggest to research participants 
the behaviors that researchers expect of them), are a well-known source of bias in research 
(Wilson et al., 2010). To be sure, as described earlier, research findings obtained outside 
the laboratory are often influenced by context. However, those contexts tend to be char- 
acteristic of the participant’s life and experience, which, far from being a confound to be 
eradicated, contribute to the ecological validity of daily life studies. Moreover, natural 
contexts tend to offer more distractions and alternatives (e.g., participants have some 
choice over what they do, when, where, and with whom), affording self-direction and 
spontaneous selection. In field research, the setting thus becomes fundamental to theoreti- 
cal accounts of behavior (Weick, 1985). In a laboratory cubicle, participants can do little 
else but complete the tasks assigned to them by researchers as quickly as possible. 

Contexts differ along many dimensions, some of which seem likely to have minimal 
impact on research. For example, administering a standardized survey in a classroom 
versus a laboratory cubicle may make little difference, whereas conducting a field experi- 
ment on the impact of affectionate smiles on attraction at a singles bar versus a labora- 
tory room may matter more. Snyder and Ickes (1985) differentiated situations in terms of 
the strength of their cues about behavior. So-called strong situations are relatively struc- 
tured, providing salient, unambiguous, and compelling cues about appropriate behavior. 
Weak situations, in contrast, are unstructured, offer few or no incentives, and have few 
or ambiguous cues to guide behavior. Snyder and Ickes propose that strong situations 
are likely to support normative theories—that is, most people behaving the same way— 
whereas weak situations are more likely to reveal individual differences (a sensible pro- 
posal that has yet to be tested empirically; Cooper & Withey, 2009). Either is amenable 
to daily life research. 

More generally, the social-psychological study of situations provides a framework 
for conceptualizing the impact of context on behavior (see Reis & Holmes, in press, for 
a review). Three dimensions have received the most attention: 


e Nominal properties of the setting. As mentioned earlier, environmental psycholo- 
gists commonly study the physical properties of behavior settings, such as environmental 
stress (e.g., noise, crowding), space utilization, the impact of architectural or natural 
design, and ambient conditions (e.g., temperature, odor). Social-psychological research 
has extensively examined the role of situational contextual cues. For example, violent 
cues in a laboratory room (e.g., a poster depicting a gun) can increase aggressive behavior 
(Berkowitz, 1982), whereas the color red increases men’s attraction to women (Elliot & 
Maier, 2009). Often, this form of influence occurs automatically (i.e., without conscious 
attention or deliberate intent) or outside of awareness (Dijksterhuis & Bargh, 2001). 


e Goals activated by the setting. The meaning people ascribe to situations often 
depends on “what happened, is happening, or might happen” (Yang, Read, & Miller, 
2009, p. 1019) with regard to their goals. Thus, to goal theorists, contexts influence 
behavior by activating certain goals, which then influence thought, affect, and behavior 
(Carver & Scheier, 1981). Situations activate goals both normatively and idiographically. 
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For example, achievement settings commonly activate performance and mastery goals, 
whereas social settings activate goals for acceptance and affiliation, but the specific form 
of these goals may vary from person to person (e.g., to achieve success or closeness vs. 
avoid failure or rejection) (Elliott & Thrash, 2002; Mischel & Shoda, 1999). Reis and 
Holmes (in press) suggest that the goal relevance of situations be conceptualized in terms 
of affordances: that situations do not dictate behavior, but rather provide opportunities 
for the expression of a person’s wishes, desires, needs, and fears. 


e Other persons present or thought about in the setting. Extensive research docu- 
ments the impact of the interpersonal context of behavior—who else is present, one’s 
history with that person and similar others in related situations, and what one is try- 
ing to accomplish with that person (Reis, Collins, & Berscheid, 2000). In other words, 
people do not respond to the same stimuli in the same way irrespective of others who 
are involved or affected, but they vary their behavior as a function of interpersonal cir- 
cumstances. Sometimes this occurs because other persons become the focal aspect of the 
situation—for example, a romantic dinner date typically emphasizes the dating partner 
more than the meal. In other instances, the setting varies because of changes in pat- 
terns of interaction. One influential theory, interdependence theory (Kelley et al., 2003), 
explains this influence in terms of outcome interdependence: the nature and extent to 
which two or more persons depend on and influence one another with respect to their 
potential outcomes from an interaction. 


Contextual factors can also be macroenvironmental, as recently highlighted by Oishi 
and Graham (2010). They argue that socioecological characteristics—“physical, societal, 
and interpersonal environments (e.g., climate, democracy, social networks) [that] affect 
the emotions, cognitions, and actions of groups and individuals” (p. 356)—have failed to 
receive sustained or systematic attention in psychological science. Although these factors 
can be difficult, if not impossible, to isolate or manipulate in the laboratory, they are well 
suited to investigation with daily life methods. 

In conclusion, daily life studies approach research with a clear appreciation for the 
importance of context. By studying behavior in natural, appropriate contexts, research- 
ers sacrifice control over settings in order to understand better how contexts influence 
behavior. Of course, contextual features can also be studied in laboratory experiments— 
most notably, by experimental manipulations of contextual variables. As valuable and 
necessary as such studies are, laboratory settings inevitably differ in subtle and perhaps 
not-so-subtle ways from the real-world circumstances they are intended to recreate. Thus, 
programs of research maximize their validity and usefulness by incorporating both kinds 
of studies. 


Daily Life Methods 
as a Tool for Description and Taxonomies 


Daily life methods have long appealed to researchers with an interest in description. For 
example, daily life studies have documented how people spend their time (Robinson & 
Godbey, 1997; Gunthert & Wenze, Chapter 8, this volume), how they socialize (Reis & 
Wheeler, 1991), what they eat (Glanz & Murphy, 2007), when they drink and smoke 
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(Collins & Muraven, 2007; Shiffman, 1993), and how they feel during various activi- 
ties (Hektner et al., 2007). This is because daily life data provide detailed and relatively 
unbiased records of real-time, real-world experience. Representativeness is essential for 
descriptive research; otherwise, that which is being described would be skewed toward 
oversampled events or accounts. For example, descriptions of daily affect based on ret- 
rospections tend to paint a more extreme picture of emotional experience than do real- 
time diaries, presumably because muted emotional states, although more common than 
extreme affects, tend to be more easily forgotten and are therefore underrepresented in 
retrospective accounts (e.g., Thomas & Diener, 1990). 

Descriptive data matter more than is generally acknowledged. For example, Asch 
explained, “Before we inquire into origins and functional relations, it is necessary to 
know the thing we are trying to explain” (1952, p. 65). Similarly, Reis and Gable com- 
mented, “To carve nature at its joints, one must first locate those joints” (2000, p. 192). 
Nevertheless, relative to hypothesis testing, description is an underappreciated and sel- 
dom practiced step in theory development in many of the behavioral sciences (Rozin, 
2001). This is unfortunate. Perhaps this has occurred because, as Jones (1998) explains, 
empirically minded researchers often confuse descriptive research with “a loose assort- 
ment of observational techniques and ‘negotiation’ by interview” (1998, p. 48). 

Daily life research, properly conducted, should not be so confused, of course. 
Description based on sound empirical methods contributes to theory development by 
characterizing the major and distinguishing features of the entities in question, thereby 
providing input for hypotheses about them, as well as informing investigations of their 
causal characteristics and typical behavioral sequelae. For example, in the biological sci- 
ences, Darwin spent years studying and cataloging barnacles and finches, generating 
observations that eventually led him to formulate the theory of evolution (Quammen, 
2007). Budding researchers are often taught to derive their hypotheses top-down, from 
general theory to particular hypotheses. Yet bottom-up thinking can also yield useful 
insights: using descriptive databases to identify the nature of a phenomenon; the cir- 
cumstances in which it is most likely to occur; and its typical covariates, consequences, 
and limiting conditions. This sort of information is also critical for applications of basic 
research. For example, knowing that adolescents often initiate risky behaviors in a social 
context (Jessor, 1992) suggests that certain kinds of interventions are more likely to be 
effective than others. 

Inasmuch as descriptive data tend to be uncommon in the behavioral sciences, it 
may not be surprising that we have few generally accepted taxonomies for classifying 
our research subject matter into conceptually related categories. (This despite the fact 
that individuals and societies often rely on lay taxonomies for understanding key entities 
in their environment [e.g., plants and food sources; Atran, 1990].) Recognizing what a 
phenomenon is (and is not) can provide a foundation for theory development in behav- 
ioral science, just as descriptive taxonomies of species provide a foundation for biologi- 
cal theories (Kelley, 1992). It may seem to some readers that the worth of taxonomies is 
self-evident. At the most elementary level, a taxonomy helps to organize existing findings 
and theories. “A taxonomy is a system for naming and organizing things into groups that 
share similar characteristics” (Montague Institute, 2010). Much like the periodic table in 
chemistry or the Diagnostic and Statistical Manual of Mental Disorders (DSM) in psy- 
chopathology, a good taxonomy both facilitates identification of conceptual similarities 
among entities, and delineates the ways one entity differs from another. (This is similar 


16 THEORETICAL BACKGROUND 


to establishing convergent and discriminant validity among constructs.) In the ideal case, 
taxonomies identify mutually exclusive categories, are sufficiently inclusive to cover all 
instances within a set, and can be applied unambiguously (Hull, 1998). More generally, a 
good taxonomy designates which aspects of a phenomenon need to be understood, which 
constructs might be fruitful in this regard, and how seemingly diverse entities might actu- 
ally be related (Rozin, 2001). 

Researchers interested in taxonomies have adopted several strategies to acquire 
the sort of comprehensive, representative datasets that are needed. For example, some 
researchers use a lexical approach: Dictionaries of common terms are created from mul- 
tiple sources, based on the premise (first suggested by Sir Francis Galton) that impor- 
tant concepts in a culture are represented by words (e.g., Edwards & Templeton, 2005). 
This approach, although useful, typically does not take the frequency of occurrence into 
account. Daily life methods can provide ideal datasets for developing taxonomies. In one 
area, affect, ESM and daily diary data have already contributed significantly to ongoing 
debates about the best structure with which to represent emotion and mood (e.g., Russell 
& Feldman Barrett, 1999; Watson, Wiese, Vaidya, & Tellegen, 1999). Other examples 
can be imagined readily. For example, daily life data about social interaction might be 
used to create formal taxonomies of the nature and impact of relationships. Speech sam- 
ples collected with the Electronically Activated Recorder might help build taxonomies 
of everyday language use (Mehl & Robbins, Chapter 10, this volume). Ambulatory or 
telemetric monitoring (Goodwin, Chapter 14, this volume) could help develop models of 
how and where people spend their time. 

Daily life data might also help validate taxonomies developed through other means. 
For example, convergent and discriminant validity for different DSM categories might 
be established by comparing ESM or EMA data for individuals in different diagnostic 
categories. One would expect similar patterns of experience for people in closely related 
categories, but not in more conceptually disparate categories. Another example can be 
seen in research on the so-called “Big Five” personality traits, where daily life data have 
been useful in establishing behavioral evidence of this structure for personality traits 
(McCabe, Mack, & Fleeson, Chapter 18, this volume; see also John & Srivastava, 1999). 
Because daily life studies are ideally suited for studying how events or states unfold over 
time (Bolger et al., 2003), they also can help describe temporal attributes associated with 
taxonomic categories (e.g., how the behavior of different personality types or relation- 
ships evolves over time). 


Conclusion 


The existence of this handbook, and the extraordinary diversity of topics and methods 
encompassed within its pages, is a sure sign that daily life methods have established their 
niche in the ecology of behavioral science methods. Whereas not long ago the questions 
that daily life methods could address were limited by available technology, recent devel- 
opments in miniaturization, accessibility of the Internet and mobile technology, and sta- 
tistical tools that take full advantage of the data they supply suggest a promising future. 
It is easy to predict, then, that in the coming years daily life studies will be an increasing 
presence in scholarly journals. 


A Conceptual Rationale 17 


This chapter has argued that the value of daily life research goes well beyond the min- 
imization of cognitive biases (see Schwarz, Chapter 2, this volume) by assessing behavior 
in real time. These benefits are not inconsiderable, but they mask the more fundamental 
gains to be realized from a more contextually grounded approach to knowledge. If the 
behavioral sciences have learned anything in the century or so since they became major 
players in academic scholarship, it is that behavior is influenced by contextual factors. 
Whether the subject of one’s attention is preferences among political candidates; health 
care decisions; consumer spending; emotions induced by life events; decisions to date, 
marry, or divorce; or learning in schools, context matters. As the various chapters in this 
handbook make plain, daily life studies are among the most effective methods for assess- 
ing the impact of context. 

To be sure, daily life methods are not the only means to study the effects of context. 
Because of their various limitations (notably, the inability to hold extraneous factors con- 
stant in an experimental design), other methods will remain superior for certain research 
purposes. Rather, daily life research is most beneficial in helping to fulfill the promise 
of methodological pluralism first advocated by Campbell more than a half-century ago 
(Campbell, 1957; Campbell & Fiske, 1959). Simply stated, and as discussed earlier, valid- 
ity is better understood as a property of research programs than of individual studies 
(Brewer, 2000). Although most researchers agree in principle about the desirability of 
supporting one’s conceptualization through multiple and diverse methods, thereby ruling 
out method-bound explanations, this principle is honored more in the saying than in the 
doing. I believe that Donald Campbell, were he alive today, would be pleased to see the 
contribution of daily life methods to methodological pluralism. 

In closing, one final argument in favor of daily life methods deserves mention: 
They’re fun! 


Note 


1. It is interesting to note that Brunswik (1956) used the term ecological validity to mean something 
different from representative design. Hammond (1998) discusses in detail how Jenkins (1974) and 
Bronfenbrenner (1977), among others, redefined Brunswik’s term to its current common usage, 
engendering some conceptual confusion. 
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CHAPTER 2 


Why Researchers Should Think 


“Real-Time” 


A Cognitive Rationale 


NORBERT SCHWARZ 


hen we want to know what people think, feel, and do, we ask them. This reliance 

on self-reports is based on the tacit assumption that people know their thoughts, 
feelings, and behaviors and can report on them “with candor and accuracy,” as Angus 
Campbell (1981), a pioneer of survey research, put it. From this perspective, problems 
arise when the research situation discourages candor and accuracy, when the questions are 
ambiguous and difficult to understand, or when the task exceeds participants’ knowledge 
and the limits of memory. A large methodological literature addresses these concerns and 
what to do about them (for reviews see Bradburn, Sudman, & Wansink, 2004; Sudman, 
Bradburn, & Schwarz, 1996; Tourangeau, Rips, & Rasinski, 2000). The lessons learned 
from this work highlight that many self-report problems can be attenuated by asking 
questions in close temporal proximity to the event of interest. Doing so constrains the 
multiple meanings of questions, reduces memory and estimation problems, and facilitates 
access to episodic detail, all of which can improve self-report. The real-time or close-in- 
time measures discussed in this handbook take advantage of this insight. 

However, these (largely uncontroversial) methodological issues are only some of the 
reasons why researchers should think real-time. At a more fundamental level, recent 
research across many areas of psychological science highlights that every aspect of human 
cognition, emotion, motivation, and behavior is situated and highly context-sensitive, 
thwarting attempts to understand it in a decontextualized way (see the contributions in 
Mesquita, Feldman Barrett, & Smith, 2010). As this work progresses, it becomes increas- 
ingly clear that our methods should acknowledge this insight. They rarely do. This issue 
goes beyond the familiar methodological questions of “How to ask about X” and pre- 
sents a fundamental (and controversial) challenge to bring our empirical operations into 
line with our theoretical assumptions. Studying psychological phenomena in the context 
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of daily life can make important contributions to this development by shedding new light 
on the situated and embedded nature of human behavior and experience. 


Preview 


This chapter elaborates on these themes. The first section summarizes basic insights into 
how respondents answer questions and sets the stage for later sections. To date, research 
into the cognitive and communicative processes underlying self-reports has rarely 
addressed real-time (or close-in-time) measurement, which presents its own set of self- 
report problems. I draw attention to some of them and offer pertinent conjectures. The 
second section addresses reports of past behavior and reviews issues of autobiographical 
memory, highlighting the role of inference strategies and lay theories in determining what 
“must have been” (Ross, 1989). It pays particular attention to what respondents can, or 
cannot, report on with some accuracy. 

The third section turns to reports of emotions and physical symptoms. It compares 
prospective reports of expected future feelings and retrospective reports of past feelings, 
with concurrent reports of momentary experience. Of particular interest are systematic 
convergences and divergences between these reports. On the one hand, predicted feel- 
ings usually converge with remembered feelings and the behavioral choices people make; 
on the other hand, all of these variables are often poorly related to actual experience as 
assessed by real-time measures (Schwarz, Kahneman, & Xu, 2009). These dynamics 
illustrate that feelings are fleeting and poorly represented in memory (Robinson & Clore, 
2002); once they have dissipated, respondents need to reconstruct what their feelings may 
have been. Shortly after the experience, episodic reconstruction can result in relatively 
accurate reports, as indicated by convergence with concurrent assessments (Kahneman, 
Krueger, Schkade, Schwarz, & Stone, 2004). But as time passes, respondents resort to 
general knowledge to infer the past experience, which is also the knowledge used for pre- 
dicting future feelings; these predictions, in turn, are the basis for intention and choice 
(“Would this be good for me?”). Hence, prediction, intention, choice, and later global 
memories converge because they are based on similar inputs—and this convergence 
seems to confirm that one’s predictions and choices were right all along. Unfortunately, 
concurrent measures often suggest otherwise, but this lesson is missed with the fading 
feeling (Schwarz et al., 2009). These dynamics impair learning from daily experience and 
challenge researchers’ reliance on the consistency of respondents’ reports as an indicator 
of validity. 

The final section turns to reports of attitudes and preferences. It reviews the promises 
and pitfalls of the traditional conceptualization of attitudes as enduring dispositions and 
notes the malleable nature of attitude reports. Whereas this malleability is usually con- 
sidered deplorable measurement error, a situated cognition approach suggests that it may 
reflect something more laudable and adaptive. If evaluation stands in the service of current 
action, it is likely to benefit from sensitivity to one’s current goals and close attention to the 
affordances and constraints of one’s current context (Schwarz, 2007). From this perspec- 
tive, the context “dependency” that frustrates observers and researchers, who both want 
to predict an actor’s behavior, reflects an adaptive context “sensitivity” that may serve the 
actor well. Real-time measurement in situ can shed new light on the underlying dynamics, 
in particular when it adopts the actor’s rather than the observer’s perspective. 
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Answering Questions: The Psychology of Self-Report 


Answering a question in a research context requires that respondents (1) interpret the 
question to understand what the researcher wants to know and (2) retrieve and select 
relevant information to (3) form an answer. In most cases, they cannot provide an answer 
in their own words but (4) need to map it onto a set of response alternatives provided 
by the researcher. Finally, (5) respondents may wish to “edit” their answer before they 
communicate it for reasons of social desirability and self-presentation. Respondents’ per- 
formance at each of these steps is context-sensitive and profoundly influenced by charac- 
teristics of the research setting and instrument. Extensive reviews of these issues are avail- 
able (Schwarz, Knauper, Oyserman, & Stich, 2008; Sudman et al., 1996; Tourangeau et 
al., 2000); I summarize key points and draw attention to some implications for real-time 
measurement. 


Question Comprehension 


The key issue at the question comprehension stage is whether respondents’ understand- 
ing of the question matches the meaning the researcher had in mind. As all textbooks 
note, writing simple questions and avoiding unfamiliar or ambiguous terms helps (see 
Bradburn et al., 2004, for good practical advice). But ensuring that respondents under- 
stand the words is not enough. When asked, “What have you done today?”, respondents 
will understand the words but they still need to determine what kinds of activities inter- 
est the researcher. Should they report, for example, whether they took a shower or not? 
Providing an informative answer requires inferences about the questioner’s intention to 
determine the pragmatic meaning of the question (Clark & Schober, 1992; Schwarz, 
1996). 


QUESTION CONTEXT AND ORDER 


To infer the pragmatic meaning, respondents draw on contextual information, from the 
purpose of the study and the researcher’s affiliation to the content of adjacent questions 
and the nature of the response alternatives. Their use of this information is licensed by 
the tacit assumptions that govern the conduct of conversation in daily life (Grice, 1975), 
which respondents bring to the research situation (for reviews, see Schwarz, 1994, 1996). 
Hence, they interpret a given question in the thematic context of the overall interview, 
and a term such as drugs acquires different meanings when presented in a survey pertain- 
ing to respondents’ medical history rather than to crime in the neighborhood. Similarly, 
they attend to the researchers’ affiliation to infer the likely epistemic interest behind their 
questions. Taking this interest into account, respondents’ explanations emphasize per- 
sonality variables when asked by a personality psychologist, but social context variables 
when asked by a social scientist (Norenzayan & Schwarz, 1999). Respondents further 
assume that adjacent questions are meaningfully related to one another, unless otherwise 
indicated, and interpret their intended meaning accordingly (e.g., Strack, Schwarz, & 
Wanke, 1991). 

When the data collection method enforces a strict sequence, as is the case for per- 
sonal and telephone interviews, and for computer-administered questionnaires that do 
not allow respondents to return to earlier questions, preceding questions can influence 
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the interpretation of subsequent questions, but not vice versa. In contrast, preceding as 
well as following questions can exert an influence when respondents can become aware of 
all questions prior to answering them, as is the case for paper-and-pencil questionnaires 
and computer programs without strict sequencing (Schwarz & Hippler, 1995). Most real- 
time studies probably fall into the latter category given that they repeat a small number of 
questions with high frequency, thus allowing respondents to know what is coming even 
when the instrument enforces a strict order. 

The maxims of cooperative conversational conduct further ask speakers to provide 
information the recipient needs and not to reiterate information the recipient already 
has (Grice, 1975). Respondents observe this norm and hesitate to reiterate information 
they have already provided in response to an earlier question (for a review, see Schwarz, 
1996). For example, Strack and colleagues (1991) observed a correlation of r = .95 when 
respondents were asked to report their overall happiness and their overall life satisfaction 
in two separate questionnaires, attributed to different researchers. However, the correla- 
tion dropped to r = .75 when the same two questions were presented in the same question- 
naire, attributed to the same researcher. In everyday discourse, the same questioner would 
not request the same information twice, in somewhat different words; hence, respondents 
differentiate between similar questions when they are presented by the same researcher. 
Two different researchers, on the other hand, may very well ask the same thing in differ- 
ent words, so identical answers are appropriate. 

Note that the repetition of very similar, if not identical, questions is a key feature of 
many real-time measurement procedures. At present, we do not know how this affects 
respondents’ question interpretation. Do respondents hesitate to repeat information at 
4:05 P.M. that they already provided at 3:40 P.M.? If they hesitate to provide the same 
answer, does their attempt to provide new information increase meaningful differen- 
tiation between episodes, or does it foster differentiations that go beyond respondents’ 
actual experience in situ? 


FORMAL CHARACTERISTICS OF QUESTIONS 


From a conversational perspective, every contribution is assumed to be related to the 
ongoing conversation, unless marked otherwise. In research settings, the researcher’s 
contributions include formal characteristics of the question, which respondents use in 
inferring the question’s pragmatic meaning (Schwarz, 1994, 1996). Suppose, for exam- 
ple, that respondents are asked how frequently they felt “really irritated” recently. Does 
this question refer to major or minor annoyances? The numeric values of the frequency 
scale provide relevant information. When the scale presents low frequencies, respondents 
infer that the researcher is interested in less frequent events than when it presents high 
frequencies; as a result, they report on major annoyances (which are relatively rare) in 
the former, but on minor annoyances in the latter case (Schwarz, Strack, Miiller, & 
Chassein, 1988). The same logic applies to the length of reference periods (Winkielman, 
Knauper, & Schwarz, 1998). Given that major annoyances are less frequent than minor 
annoyances, respondents infer that the question pertains to minor irritations when it is 
presented with a short reference period (e.g., “yesterday”), but to major annoyances when 
presented with a long reference period (e.g., “last 6 months”). Accordingly, questions 
with reference periods of differential length assess substantively different experiences 
(e.g., “minor” rather than “major” episodes of anger). 
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This has potentially important implications for real-time measurement, which usu- 
ally includes very short and recent reference periods. When asked at 4:05 P.M. how often 
they have been angry since the last measurement at 3:40 P.M., respondents may report 
on very minor episodes, which they would not consider worth mentioning for any lon- 
ger reference period. Moreover, once they assume that this is what the questioner has in 
mind, they may evaluate each minor episode relative to other minor episodes. Consistent 
with this shift in the frame of reference, they may then assign each minor episode a high 
intensity rating, leading the researcher to conclude that intense anger is very frequent. To 
date, these possibilities have not been addressed, and little is known about the potential 
impact of high-density measurement on question interpretation. 


Recall and Judgment 


Once respondents determine what the researcher is interested in, they need to recall rel- 
evant information to form a judgment. In some cases, they may have direct access to a 
previously formed relevant judgment they can offer as an answer. More likely, however, 
they will need to form a judgment when asked, taking the specifics of the question and the 
questioner’s inferred epistemic interest into account. The processes pertaining to different 
types of reports are discussed in the sections on behaviors, feelings, and attitudes. 


Formatting the Response 


Unless the question is asked in an open response format, respondents need to format 
their answer to fit the response alternatives provided by the researcher (for a review, see 
Schwarz & Hippler, 1991; Sudman et al., 1996). Respondents observe these question 
constraints and avoid answers that are not explicitly offered. Moreover, their selection of 
response alternatives is influenced by the order in which they are presented. In most cases, 
a given response alternative is more likely to be chosen when presented early rather than 
late on the list under visual presentation conditions, reflecting the sequence of reading. 
Conversely, a given alternative is more likely to be chosen when presented late rather than 
early on the list under auditory presentation conditions; respondents need to wait for the 
interviewer to finish reading the list and work backward, beginning with the last alterna- 
tive heard (Krosnick & Alwin, 1987; Sudman et al., 1996, Chapter 6). This suggests that 
real-time data capture through interactive voice responding, in which response alterna- 
tives are presented auditorily, may facilitate the emergence of recency effects, whereas the 
visual presentation formats typical for the experience sampling method (ESM) and daily 
diaries may facilitate primacy effects. 

Finally, respondents’ use of rating scales reflects two regularities that are familiar 
from psychophysical research; both have been conceptualized in Parducci’s (1965) range- 
frequency theory (see Daamen & de Bie, 1992, for social science examples). First, respon- 
dents use the most extreme stimuli to anchor the endpoints of the scale. Accordingly, they 
rate a given episode of anger as less intense when the high end of the scale is anchored by 
an extreme rather than a moderate anger episode. This has important implications for 
the comparability of retrospective and real-time reports. When asked to rate a single past 
episode, the recalled episode is likely to be compared to other memorable instances— 
which are often memorable because they were extreme. But when asked to rate multiple 
episodes over the course of a single day, previously rated moderate episodes may still be 
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highly accessible. Hence, the same episode of anger may be rated as more extreme in 
real-time than in retrospective reports, reflecting the use of differentially extreme scale 
anchors and comparison standards. 

Second, psychophysical research further shows that respondents attempt to use all 
categories of the rating scale about equally often when the number of to-be-rated stimuli 
is large. Hence, two similar stimuli may receive notably different ratings when only a few 
stimuli are presented, but identical ratings when many stimuli have to be located along 
the same scale. In many real-time studies, respondents are asked to rate a large number 
of episodes along identical scales over the course of a few hours, which is likely to elicit 
similar shifts in ratings. Both of these regularities predict systematic differences between 
retrospective and concurrent ratings, as well as between concurrent ratings assessed with 
differential frequency. Future research may fruitfully test this possibility. 


Editing the Response: Social Desirability 


As the final step of the question answering sequence, respondents have to communicate 
their answer. Due to social desirability and self-presentation concerns, they may edit their 
response (see DeMaio, 1984, for a review). This is more likely in face-to-face interviews 
than under the more confidential conditions of self-administered questionnaires, with 
telephone interviews falling in between. This is good news for real-time data capture, 
which predominantly relies on self-administered formats. 

The literature further indicates that influences of social desirability are limited to 
potentially threatening questions and typically modest in size (DeMaio, 1984). Note, how- 
ever, that a behavior that may seem only mildly unfavorable when reported once for a 
single specific episode (e.g., “I don’t enjoy being with my spouse right now”) may become 
a major self-presentation concern when the same answer would need to be provided over 
several episodes. If so, high-density measurement in real-time studies may accentuate self- 
presentation concerns relative to retrospective reporting conditions through the cumulative 
impact of social desirability concerns over multiple similar episodes. Finally, respondents’ 
self-presentation concerns can be reliably reduced through techniques that ensure the ano- 
nymity and confidentiality of the answer (see Bradburn et al., 2004, for detailed advice). 


Reporting on Behaviors 


This section focuses on the recall stage of the question answering process, and highlights 
what respondents can and cannot remember and report. It is organized by the type of 
information the researcher wants to access. 


Historical Information 


Some questions pertain to historical information. Examples include “Have you ever had 
an episode of heartburn?” and “In what year did you first have an episode of heartburn?” 
Respondents’ memories are usually the only available source of information, and real- 
time measurement is not feasible. The best a researcher can do is to use interviewing tech- 
niques that take the structure of autobiographical memory into account to facilitate recall 
(for advice, see Belli, 1998; Schwarz & Oyserman, 2001; Tourangeau et al., 2000). 
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Current models of autobiographical memory conceptualize it as a hierarchical net- 
work that includes extended periods (e.g., “the years I lived in New York”) at the highest 
level of the hierarchy. Nested within each extended period are lower-level extended events 
(e.g., “my first job” or “the time I was married to Lucy”). Further down the hierarchy are 
summarized events, which take the form of knowledge-like representations that lack epi- 
sodic detail (e.g., “During that time, I was frequently ill”). Specific events, like a particu- 
lar episode of illness, are represented at the lowest level of the hierarchy; to be represented 
at this level of specificity, the event has to be unique. As Belli (1998, p. 383) notes, this 
network, organized by time (“the years in New York”) and relatively global themes (“first 
job,” “first marriage;” “illness”), enables “the retrieval of past events through multiple 
pathways that work top-down in the hierarchy, sequentially within life themes that unify 
extended events, and in parallel across life themes that involve contemporaneous and 
sequential events.” Such searches take time and their outcome is somewhat haphazard, 
depending on the entry point into the network at which the search started. Building on 
these insights, event history calendars improve recall by using multiple entry points and 
forming connections across different periods and themes (see Belli, 1998, for a review 
and detailed advice). 

In the absence of such (costly) efforts, respondents are likely to apply extensive infer- 
ence strategies to the few bits and pieces they remember to infer what must have “been” 
(Ross, 1989). Suppose, for example, that respondents are asked how much alcohol they 
drank 5 years ago. Having no direct access to this information, they are likely to consider 
their current alcohol consumption as a benchmark and to make adjustments if they see 
a need to do so. In most cases, their adjustments are insufficient because people assume 
an unrealistically high degree of stability in their behavior. This results in retrospective 
reports that are more similar to the present than warranted, as observed for reports of 
income (Withey, 1954); pain (Eich, Reeves, Jaeger, & Graff-Radford, 1985); or tobacco, 
marijuana, and alcohol consumption (Collins, Graham, Hansen, & Johnson, 1985). 
However, when respondents have reason to believe things were different in the past, they 
“remember” change (Ross, 1989), which is discussed next. 


Reports of Change, Covariation, and Causation 


Some questions go beyond mere retrospective reports and ask respondents to report on 
change over time (“Do you smoke more or less now than you did when you were 30?”) 
or to assess the covariation of their behavior with other variables (“Do you smoke more 
when you are stressed?”). Respondents can rarely retrieve the information needed to 
answer such questions and rely on extensive inference and estimation strategies to deter- 
mine what might have been. Their answers are useful to the extent that the underlying lay 
theories happen to be correct, which is usually unknown. 

Although most people assume an unwarranted amount of stability in their behavior, 
they readily detect change when their lay theory suggests that change must have occurred. 
This is particularly likely—and problematic—when the context suggests change, as is 
often the case in medical studies: Believing that things get better with treatment (or why 
else would one undergo it?), patients are likely to infer that their past condition must 
have been worse than their present condition (e.g., Linton, 1982; for a review, see Ross, 
1989). From a cognitive perspective, asking patients whether they feel better now than 
before their treatment is the most efficient way to “improve” the success rate of medical 


A Cognitive Rationale 29 


interventions, which may explain the recent popularity of “patient-reported outcomes.” 
Unfortunately, there is no substitute for appropriate study design. If change over time is 
of crucial interest, concurrent measures at different points in time are the only reliable 
way to assess it. 

Similar problems arise when respondents are asked to report on covariation (“Under 
which circumstances ... ?”) or causation (“Why ... ?”). To arrive at observation-based 
answers to these questions, respondents need to have an accurate representation of the 
frequency of their behaviors, the different contexts of these behaviors, and the intensity 
of related experiences. Respondents are often unable to provide accurate reports on any 
of these components, making their joint consideration an unrealistically complex task. 

Covariation and causation are best assessed with real-time data capture. ESMs excel 
at this task by prompting respondents to report on their behavior, experiences, and cir- 
cumstances, allowing researchers to collect all the data needed for appropriate analyses. 
However, an important caveat needs attention. While real-time or close-in-time measures 
improve the accurate assessment of covariation, causation, and change, respondents’ own 
behavioral decisions are based on their own perceptions, which may differ from reality. 
Hence, erroneous lay theories of covariation are often better predictors of behavior than 
accurate measures of covariation, as reviewed in the section on feelings. 


Frequency Reports 


Frequency questions ask respondents to report on the frequency of a behavior or expe- 
rience during a specified reference period, often last week or last month. Researchers 
typically hope that respondents will identify the behavior of interest, search the refer- 
ence period, retrieve all instances that match the target behavior, and finally count these 
instances to determine the overall frequency of the behavior. However, such a recall-and- 
count strategy is rarely feasible. Respondents usually need to rely on extensive inference 
and estimation strategies to arrive at an answer; which strategy they use depends on the 
frequency, importance, and regularity of the behavior (e.g., Brown, 2002; Menon, 1993, 
1994; Sudman et al., 1996). 

Questions about rare and important behaviors can be answered on the basis of auto- 
biographical knowledge or a recall-and-count strategy. When asked “How often did you 
get divorced?” most people know the answer without extended memory search. When 
asked “How often did you relocate to another city?” many people do not know immedi- 
ately but can compute an appropriate answer by reviewing their educational and job his- 
tory, following a recall-and-count strategy. Respondents’ task is more demanding when 
the behavior is frequent. High frequency of a behavior makes it unlikely that detailed 
representations of numerous individual episodes are stored in memory; instead, different 
instances blend into one global, knowledge-like representation that lacks specific time or 
location markers (see Linton, 1982; Strube, 1987). Frequent doctor visits, for example, 
result in a well-developed knowledge structure for the general event, allowing respon- 
dents to report in considerable detail what usually goes on during their doctor visits. But 
the highly similar individual episodes become indistinguishable and irretrievable, mak- 
ing it difficult to report on any specific one. In these cases, respondents need to resort to 
estimation strategies to arrive at a plausible frequency report. Which estimation strategy 
they use depends on the regularity of the behavior and the context in which the frequency 
question is presented. 
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When the behavior is highly regular, frequency estimates can be computed on the 
basis of rate information (Menon, 1994; Menon, Raghubir, & Schwarz, 1995). Respon- 
dents who go to church every Sunday have little difficulty in arriving at a weekly or 
monthly estimate. However, exceptions are likely to be missed, and the estimates are 
accurate only when exceptions are rare. A related strategy relies on extrapolation from 
partial recall. When asked how often she took pain medication during the last week, for 
example, a respondent may reason, “I took pain killers three times today, but this was a 
bad day. So probably twice a day, times 7 days, makes 14 times a week.” The accuracy 
of this estimate depends on the accuracy of the underlying assumptions, the regularity of 
the behavior, and the day that served as input into the chain of inferences. 

Other estimation strategies may even bypass any effort to recall specific episodes. For 
example, respondents may simply rely on information provided by the research instru- 
ment itself. As an example, consider the frequency scales shown in Table 2.1. Consistent 
with the maxims of cooperative conversational conduct (Grice, 1975), respondents assume 
that the researcher constructed a meaningful scale that is relevant to their task (Schwarz, 
1996). Presumably, the range of response alternatives reflects the researcher’s knowledge 
about the distribution of the behavior, with values in the middle range of the scale cor- 
responding to the “usual” or “average” behavior, and values at the extremes of the scale 
corresponding to the extremes of the distribution. Drawing on these assumptions, respon- 
dents use the range of the response alternatives as a frame of reference in estimating their 
own behavioral frequency. This results in higher frequency estimates when the scale pre- 
sents high- rather than low-frequency response alternatives, as Table 2.1 illustrates. 

Such scale-based estimation effects have been observed for a wide range of behaviors 
(for a review, see Schwarz, 1996); the more pronounced, the more poorly the respective 
behavior is represented in memory (Menon et al., 1995). When behaviors of differential 
memorability are assessed, the differential influence of response alternatives can either 
exaggerate or cloud actual differences in the relative frequency of the behaviors, under- 
mining comparisons across behaviors. Moreover, respondents with poorer memory are 
more likely to be influenced by frequency scales than are respondents with better mem- 
ory (e.g., Knäuper, Schwarz, & Park, 2004), which can undermine comparisons across 
groups. Finally, frequency scales also invite systematic underestimates of the variance in 
behavioral frequencies because all respondents draw on the same frame of reference in 
computing an estimate, resulting in reports that are more similar than reality warrants. 


TABLE 2.1. Reported Daily TV Consumption as a Function 
of Response Alternatives 


Low-frequency alternatives High-frequency alternatives 

Up to % hour 7.4% Up to 2% hours 62.5% 
% hour to 1 hour 17.7% 2% hours to 3 hours 23.4% 
1 hour to 1% hours 26.5% 3 hours to 3% hours 7.8% 
1% hours to 2 hours 14.7% 3% hours to 4 hours 4.7% 
2 hours to 2% hours 17.7% 4 hours to 4% hours 1.6% 
More than 2% hours 16.2% More than 4% hours 0.0% 


Note. N = 132. Adapted from Schwarz, Hippler, Deutsch, and Strack (1985). 
Reprinted by permission of Oxford University Press. 
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Self-Reports of Feelings 


Feelings are subjective phenomena to which the person who has them has privileged 
access. While this does not imply that feelings are always easy to identify for the expe- 
riencer (for a discussion of different types of feelings and the underlying appraisal pro- 
cesses, see Clore, Schwarz, & Conway, 1994; Ellsworth & Scherer, 2003), most research- 
ers consider the experiencer the final arbiter of his or her own feeling. Unfortunately, that 
final arbiter is likely to tell us different things at different points in time, and numerous 
studies have documented profound discrepancies between people’s concurrent and retro- 
spective reports of emotions (for a comprehensive review, see Robinson and Clore, 2002). 
This section reviews why this is the case, presents some illustrative biases, and highlights 
distinct patterns of convergence and divergence among prospective, concurrent, and ret- 
rospective reports, as well as the choices people make (for further discussion of emotion 
measurement, see Augustine & Larsen, Chapter 27, this volume). 


Accessibility Model of Emotion Report 


To conceptualize the processes underlying emotion reports, Robinson and Clore (2002) 
proposed an accessibility model. When people report on their current feelings, the feel- 
ings themselves are accessible to introspection, allowing for accurate reports on the basis 
of experiential information. But affective experiences are fleeting and not available to 
introspection once the feeling dissipates. Accordingly, the opportunity to collect emotion 
reports based on introspective access is limited to methods of real-time data capture, 
such as experience sampling (Stone, Shiffman, & De Vries, 1999; see also Augustine 
& Larsen, Chapter 27, this volume). Once the feeling dissipates, the affective experi- 
ences need to be reconstructed on the basis of episodic or semantic information. When 
the report pertains to a specific recent episode, people can draw on episodic memory, 
retrieving specific moments and details of the recent past. Detailed episodic recall can 
often reelicit a similar feeling (and is therefore a popular mood manipulation); it can also 
provide sufficient material for relatively accurate reconstruction. Hence, episodic reports 
often recover the actual experience with some accuracy, as indicated by convergence with 
concurrent reports (e.g., Kahneman et al., 2004; Robinson & Clore, 2002; Stone et al., 
2006). One method that facilitates episodic reporting is the day reconstruction method 
(DRM; Kahneman et al., 2004, 2006), discussed below. At present, it remains unclear 
how far in the past an episode can be and still allow reasonably accurate episodic recon- 
struction. Most likely the answer depends on the uniqueness and memorability of the 
episode, paralleling the earlier discussion of behavioral frequency reports. 

In contrast to episodic reports, global reports of past feelings are based on seman- 
tic knowledge. When asked how they “usually” feel during a particular activity, people 
draw on their general beliefs about the activity and its attributes to arrive at a report. The 
actual experience does not figure prominently in global reports because the experience 
itself is no longer accessible to introspection, and episodic reconstruction is not used to 
answer a global question. 

Extending this accessibility model of emotion report, Schwarz and colleagues (2009; 
Xu & Schwarz, 2009) noted that the same semantic knowledge serves as a basis for 
predicting future feelings, for which episodic information is not available. Such predic- 
tions are usually more extreme than people’s actual experiences (for a review, see Wilson 
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& Gilbert, 2003) because the predictor focuses on core attributes of the activity at the 
expense of other information, resulting in a “focusing illusion” (Schkade & Kahneman, 
1997). For example, Midwesterners who predict how happy they would be if they moved 
to California may focus on the pleasant California climate, missing, for example, that 
they would still have to spend most of the day in an office cubicle. Finally, hedonic predic- 
tions play an important role in people’s daily lives because they serve as input into choice 
(March, 1978; Mellers & McGraw, 2001) and influence which course of action people 
will or will not take. 


Convergences and Divergences 


The previous rationale predicts a systematic pattern of convergences and divergences, 
which results directly from the inputs on which the respective reports are based. First, 
concurrent reports and retrospective reports pertaining to a specific and recent episode 
are likely to show good convergence, provided that the episode is sufficiently recent to 
allow detailed and vivid reinstantiation in episodic memory. Second, retrospective global 
reports of past feelings and predictions of future feelings are also likely to converge given 
that both are based on the same semantic inputs. Third, choices are based on predicted 
hedonic consequences, and hence converge with predictions and global memories. One 
unfortunate side effect of these convergences is that people’s global memories seem to 
“confirm” the accuracy of their predictions and the wisdom of their choices, thus impair- 
ing the opportunity to learn from experience (Schwarz & Xu, 2011). Fourth, concurrent 
and episodic reports, however, often diverge from prediction, choice, and global memo- 
ries. Asa result, different measures can paint very different pictures of a person’s affective 
experience with the same situation, as a few examples may illustrate (see Schwarz et al., 
2009, for a review). 


HOW ENJOYABLE ARE VACATIONS? 


Not surprisingly, people believe that vacations are very enjoyable, and this belief shapes 
their predictions, choices, and global memories, even when their actual recent experi- 
ence was less rosy. Assessing prospective, concurrent, and retrospective reports of vaca- 
tion enjoyment, Mitchell, Thompson, Peterson, and Cronk (1997) found that prospective 
reports converged with retrospective reports; however, both the predicted and remem- 
bered affect was more positive than the affect reported concurrently during the vaca- 
tion. In a later study, Wirtz, Kruger, Scollon, and Diener (2003) tracked college students 
before, during, and after their spring break vacations and compared their predicted, con- 
current, and remembered affect. They found that predicted and remembered experiences 
were more intense (i.e., both more positive and more negative) than those reported con- 
currently during the vacation. However, the (biased) remembered experience predicted 
the desire to repeat the vacation better than did the actual experience, illustrating that we 
learn from our memories, not from our experiences. 


HOW BAD WAS THAT COLONOSCOPY? 


Particularly memorable examples of learning from memory rather than experience have 
been reported in the medical domain. For example, Redelmeier and Kahneman (1996) 
observed that retrospective evaluations of pain are dominated by two moments that may 
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be of particular adaptive relevance (Fredrickson, 2000): the peak (“How bad does it 
get?”) and the end (“How does it end?”). Other aspects, such as the overall duration of 
pain, exert little influence. In fact, extending the duration of a colonoscopy by adding a 
few moments of discomfort at the end improves the overall evaluation of the episode by 
adding a better ending. It also improves the likelihood of future compliance, again high- 
lighting how memory beats experience in predicting future behavior (Redelmeier, Katz, 
& Kahneman, 2003). 


HOW MUCH DO PARENTS ENJOY SPENDING TIME WITH THEIR CHILDREN? 


Several decades ago, Juster (1985) asked a representative sample of Americans to rate 28 
activities from dislike very much (0) to enjoy a great deal (10). They found that activities 
with one’s children consistently topped the list (ranked 1-4), whereas grocery shopping 
and cleaning the house were reported as least enjoyable (ranked 27 and 28; Juster, 1985, 
p. 336). In stark contrast to these reports, other studies indicate that parents’ marital 
satisfaction drops when children arrive, reaches a lifetime low when the children are 
teenagers, and recovers after the children leave the house (for a review, see Argyle, 1999). 
Are the children a more mixed pleasure than global reports of enjoyment convey? Close- 
in-time measures of affective experience, collected with the DRM, suggest so. Specifi- 
cally, 909 employed women in Texas recalled their activities during the preceding day 
and reported how they felt during each specific episode (Kahneman et al., 2004). In these 
episodic reports, activities coded as “taking care of my children” ranked just above the 
least enjoyable activities of the day, namely, working, housework, and commuting; data 
from other samples replicated this pattern. 

Several processes contribute to this divergence between global and episodic reports. 
First, global judgments of enjoyment are based on general beliefs (“I enjoy my kids”), 
which are often supported by belief-consistent memories of great vividness (e.g., fond 
memories of shared activities). Yet most mundane episodes of a given day are less enjoy- 
able than the episodes that make for fond memories. Second, activities are organized in 
memory by their focal features. Attempts to recall memories pertaining to one’s interac- 
tions with one’s children therefore result in an overrepresentation of child-focused activi- 
ties, at the expense of numerous other episodes of the day in which the children were 
present. The reconstruction of a whole day in the DRM avoids many of these problems 
of selective recall and provides a fuller assessment of the affective impact of children 
throughout the day. Hence, the findings suggest that part of the reason children seem 
more enjoyable in general than on any given day is simply that parents do not consider 
the full range of child-related time use when providing global reports. Finally, global 
reports are subject to higher social desirability pressures than episodic reports. A parent 
who reports, “I don’t enjoy spending time with my children” is clearly a bad parent, but 
noting that “they were a pain last night” is perfectly legitimate. 


IMPLICATIONS 


Several methodological implications are worth emphasizing. Researchers who want 
to assess people’s actual hedonic experiences should preferably do so with concurrent 
reports, using experience sampling methodologies (Stone et al., 1999). If this is not fea- 
sible, episodic reporting methods, such as the DRM (Kahneman et al., 2004), provide 
a less burdensome alternative that can capture the experience with some accuracy, pro- 
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vided that the relevant episodes are recent. In contrast, global reports of affect are theory- 
driven, not experience-driven. They capture respondents’ beliefs about their experience 
rather than the experience itself and are subject to pronounced focusing effects. 

However, people’s behavioral choices are based on their expected hedonic conse- 
quences (March, 1978). These expectations converge with global memories but often 
diverge from actual experience. Hence, predictions, choices, and global memories are 
poor indicators of experience. Yet when people make behavioral decisions, global memo- 
ries and expectations are likely to figure prominently in the information they consider. 
Ironically, this turns faulty indicators of experience into good predictors of future choices 
and behaviors (e.g., Wirtz et al., 2003). It also suggests that optimizing a study for accu- 
rate description of what people do and feel does not optimize it for accurate prediction 
of what they will do next (and vice versa): Description and prediction are different goals, 
and their optimization requires different strategies. 


An Example of Episodic Reconstruction: The Day Reconstruction Method 


The DRM (Kahneman et al., 2004) is designed to collect data that describe a person’s 
time use and affect on a given day through a systematic reconstruction on the following 
day. In a self-administered questionnaire, respondents first reinstantiate the previous day 
into working memory by producing a short diary consisting of a sequence of episodes, 
usually covering the time they got up to the time they went to bed. The diary’s format 
draws on insights from cognitive research with event history calendars (Belli, 1998) and 
facilitates retrieval from autobiographical memory through multiple pathways. Its epi- 
sodic reinstantiation format attenuates biases commonly observed in retrospective reports 
(Robinson & Clore, 2002; Schwarz & Sudman, 1994). Respondents’ diary entries are 
confidential, and the diary is not returned to the researcher, which allows respondents to 
use idiosyncratic notes, including details they may not want to share. 

Next, respondents draw on their diaries to answer a series of questions about each 
episode, including (1) when the episode began and ended, thus providing time use data; 
(2) what they were doing; (3) where they were; (4) with whom they were interacting; and 
(5) how they felt, assessed on multiple affect dimensions. The details of this response 
form can be tailored to the specific issues under study; only this form is returned to 
the researcher for analysis. For methodological reasons, it is important that respondents 
complete the diary before they are aware of the specific content of the later questions 
about each episode. Early knowledge of these questions may affect the reconstruction 
of the previous day and may introduce selection bias. The DRM can be administered 
individually or in group settings, and respondents can report on a complete day in 45-75 
minutes. DRM reports have been validated against experience sampling data, and Krue- 
ger, Kahneman, Schkade, Schwarz, and Stone (2009) provide a comprehensive review of 
the methodology and available findings. 


Self-Reports of Attitudes: Evaluation in Context 


Another common type of self-report question asks people to report on their likes and 
dislikes. Psychologists commonly assume that these reports reflect a predisposition to 
evaluate some object in a favorable or unfavorable manner; this predisposition is referred 
to as an attitude (Eagly & Chaiken, 1993, 2005). Attitudes are hypothetical constructs 
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that cannot be directly observed and need to be inferred from individuals’ responses 
to the attitude object. As Allport (1935, p. 836) put it, “How does one know that atti- 
tudes exist at all? Only by necessary inference. There must be something to account for 
the consistency of conduct” (italics added). From this perspective, it is not surprising 
that attitude questions are often asked without reference to any specific context—what 
makes the construct appealing is exactly the promise of predictive power across con- 
texts. Empirically, attitude research never delivered on this promise. In an early review 
of attitude-behavior consistency, Wicker (1969, p. 65) concluded that “only rarely can 
as much as 10% of the variance in overt behavioral measures be accounted for by atti- 
tudinal data.” Even the attitude reports themselves proved highly malleable, and minor 
variations in question wording, question order, or response format can elicit profound 
shifts in reported attitudes, even on familiar and important topics (for early examples, see 
Cantril, 1944; Payne, 1951; for reviews, see Schuman & Presser, 1981; Schwarz, 1999; 
Schwarz, Groves, & Schuman, 1998). Attempts to overcome these disappointments took 
one of two general paths: one focused on improving measurement of the attitude itself, 
and the other on improving the predictive power of the attitude measure by taking con- 
text variables into account. 


Stalking the “True” Attitude 


Mirroring Campbell’s (1981) convictions, many researchers assumed that context effects 
on attitude reports and low attitude-behavior consistency can be traced to participants’ 
hesitation to report their true feelings “with candor and accuracy.” This focused efforts 
on attempts to reduce respondents’ self-presentation concerns (e.g., techniques that ensure 
respondents’ anonymity and the confidentiality of their answers; for recommendations, 
see Bradburn et al., 2004) or to convince them that “lying” was futile—thanks to sophis- 
ticated machinery, the researcher would learn their “true” attitude anyway (e.g., Jones 
and Sigall’s [1971] “bogus pipeline”). Empirically, such techniques have been found to 
increase the frequency of socially undesirable answers. For example, people are more 
likely to admit that they enjoy pornography when they cannot be identified as the source 
of the answer (Himmelfarb & Lickteig, 1982), and white participants are more likely to 
report that they dislike African Americans under bogus pipeline conditions (e.g., Allen, 
1975). However, external validation of the reports is not available, and the procedures 
themselves may invite correction of one’s spontaneous answer in light of the concern 
about bias that is clearly conveyed. 

Whereas these developments assumed that people know their own attitudes but may 
not want to report them, later developments considered the possibility that people may 
sometimes not be aware of their own attitudes, or may not even want to “admit” them to 
themselves. Implicit measures of attitudes address this possibility (for overviews, see the 
contributions in Wittenbrink & Schwarz, 2007). These procedures range from evaluative 
and conceptual priming techniques (for a review, see Wittenbrink, 2007) and response 
competition procedures (e.g., the Implicit Association Test [IAT]; for a review, see Lane, 
Banaji, Nosek, & Greenwald, 2007) to low-tech paper-and-pencil measures (e.g., word 
completion tasks; for a review, see Vargas, Sekaquaptewa, & von Hippel, 2007). To 
many researchers’ disappointment, implicit measures did not deliver the robust, context- 
independent assessment of attitudes for which theorists have long hoped. To the con- 
trary, implicit measures of attitudes are subject to the same context effects that have been 
observed with explicit self-reports (for extensive reviews, see Blair, 2002; Ferguson & 
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Bargh, 2007). For example, Dasgupta and Greenwald (2001) found that exposure to pic- 
tures of liked African Americans and disliked European Americans resulted in shifts on 
a subsequent IAT that paralleled previously observed effects of exposure to liked or dis- 
liked exemplars on explicit measures of attitudes (e.g., Bodenhausen, Schwarz, Bless, & 
Wanke, 1995). Similarly, Wittenbrink, Judd, and Park (2001) found that the same black 
face primes elicited more negative automatic responses when the faces were presented on 
the background of an urban street scene rather than a church scene. Moreover, automatic 
evaluations have also been obtained for novel objects for which no previously acquired 
object-attitude links could have been stored in memory (e.g., Duckworth, Bargh, Garcia, 
& Chaiken, 2002). 

Such findings make it unlikely that implicit measures provide a “bona fide pipeline” 
(Fazio, Jackson, Dunton, & Williams, 1995) to people’s true and enduring attitudes, 
formed on the basis of past experience and stored in memory as object-evaluation asso- 
ciations. However, the findings are fully compatible with an alternative conceptualization 
of attitudes as evaluations in context (for variants on this theme, see Ferguson & Bargh, 
2007; Lord & Lepper, 1999; Schwarz, 2007). 


Attitude Construal: Evaluation in Context 


As James (1890, p. 333) observed, “My thinking is first and last and always for the sake 
of my doing.” Few psychologists doubt this truism, but even fewer heed its implications. 
To serve action in a given context, any adaptive system of evaluation should be informed 
by past experience but highly sensitive to the specifics of the present. It should overweigh 
recent experience at the expense of more distant experience, and experience from similar 
situations at the expense of experience from dissimilar situations. In addition, it should 
take current goals and concerns into account to ensure that the assessment is relevant to 
what we attempt to do now, in this context. In short, only context-sensitive evaluation 
can guide behavior in adaptive ways by alerting us to problems and opportunities when 
they exist; by interrupting ongoing processes when needed (but not otherwise); and by 
rendering information highly accessible that is relevant now, in this situation. From this 
perspective, it is no coincidence that any list of desirable context sensitivities reads like a 
list of the conditions that give rise to context effects in attitude judgment (Schwarz, 1999, 
2007). 

Close attention to context also improves the predictive value of attitude reports, 
as reflected in increased attitude-behavior consistency. This was first highlighted in the 
seminal work of Fishbein and Ajzen (1975), who considered it a measurement issue, not 
a conceptual issue. However, the underlying principle follows directly from attitude con- 
strual models: An evaluation reported at Time 1 will map onto an evaluation or behav- 
ioral decision at Time 2 to the extent that the person draws on the same inputs at both 
points in time. This matching principle (Lord & Lepper, 1999) offers a coherent con- 
ceptualization of the conditions of stability, as well as change in attitude reports, and 
predicts when attitude judgments will or will not relate to later behavioral decisions (for 
reviews, see Lord & Lepper, 1999; Schwarz, 2007). Numerous variables—from the per- 
son’s current goals to the nature of the context and the frequency and recency of previous 
exposure—can influence the temporary construal of the attitude object and hence the 
observed consistencies and inconsistencies across time and contexts. 
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TAKING THE ACTOR’S PERSPECTIVE 


Construal models of attitudes are compatible with broader current developments in psy- 
chological science, most notably our increasing understanding of the situated and embod- 
ied nature of cognition, emotion, and motivation (for recent reviews, see Barsalou, 2005; 
Niedenthal, Barsalou, Winkielman, Krauth-Gruber, & Ric, 2006; and the contributions 
in Mesquita et al., 2010). But much as social psychologists would expect, construal mod- 
els lack the intuitive appeal of dispositional attitude models. After all, the logic of dis- 
positional models is fully compatible with observers’ robust preference for dispositional 
rather than situational explanations, also known as the fundamental attribution error 
(Ross, 1977). In contrast, construal models emphasize the role of contextual variables, 
which are usually more attended to by the actor (Jones & Nisbett, 1971), who benefits 
from the context sensitivity of evaluation in situ. From this perspective, Allport’s (1935) 
hope that enduring attitudes can account for an actor’s “consistency of conduct” in the 
present is an observer’s dream but an actor’s nightmare. After decades of conducting 
attitude research predominantly from the perspective of an observer who tries to predict 
an actor’s behavior, the increasing interest in how people live and experience their lives 
on a moment-to-moment basis may contribute to a more systematic exploration of evalu- 
ation in context from an actor’s perspective (see also Mehl & Robbins, Chapter 10, this 
volume). 


Coda: Questions In Situ 


As this selective discussion of the complexities of self-report indicates, retrospective ques- 
tions often ask respondents for information they cannot provide with any validity, as 
discussed in the sections on self-reports of behaviors and feelings. Other questions ask 
for generic answers that may be incompatible with the contextualized and situated nature 
of human experience. In the case of attitude measurement, much of the appeal of the 
enterprise rests on the hope of predicting behavior across contexts, leading researchers 
to discount the context sensitivity of evaluative judgment as undesirable noise. Methods 
of real-time or close-in-time measurement attenuate these problems by assessing infor- 
mation in situ, thus allowing (at least potentially) for the simultaneous assessment of 
contextual and experiential variables, and by posing more realistic tasks in the form of 
questions about respondents’ current behavior, experiences, and circumstances. These 
are promising steps. 

At the same time, asking questions in situ raises new self-report issues, which so far 
have received limited attention. Central to these new issues is the high density of most 
real-time data-capture procedures, which require respondents to answer the same ques- 
tions multiple times within a relatively short time. As noted in the section on question 
comprehension, this introduces conversational issues of nonredundancy (Grice, 1975; 
Schwarz, 1994) that may invite an emphasis on what is unique and new in each episode 
at the expense of attention to what is shared across episodes and has therefore already 
been reported earlier, making its repetition a violation of conversational norms. Simi- 
larly, rating many episodes along the same scale invites attention to the frequency prin- 
ciple (Parducci, 1965) of rating scale use, eliciting differentiation in the reports that may 
exceed differences in experience. Moreover, repeated ratings make it likely that previous 
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related episodes are still accessible and serve as scale anchors or comparison standards. 
In most cases, these recent anchors would be less extreme than the “memorable” events 
used to anchor rating scales in one-time ratings. If so, a given episode would be rated as 
more intense in real-time assessment, where it is evaluated against a less extreme anchor, 
than in retrospective assessment, where it is evaluated against a more distant “memo- 
rable” episode. The cognitive and communicative processes underlying real-time self- 
reports require the systematic exploration and experimentation that has advanced the 
understanding of self-reports in other domains (Schwarz & Sudman, 1996; Sudman et 
al., 1996). Without such work, we run the risk of merely replacing known biases with 
unknown ones. 

Finally, advocates of real-time measurement probably do not appreciate the conclu- 
sion that accurate assessments of real-time experience are poorer predictors of future 
behavioral choices than faulty memories of the same experience (e.g., Kahneman, 
Fredrickson, Schreiber, & Redelmeier, 1993; Redelmeier et al., 2003; Wirtz et al., 2003). 
As one reviewer of this chapter put it, “Why should we even bother measuring experi- 
ence if global or retrospective assessments are the ‘better’ predictors of choice?” The 
answer is simple: There is more to behavioral science than the observer’s desire to predict 
others’ choices. A full understanding of the human experience requires attention to the 
actor’s perspective and insight into how people live and experience their lives. Real-time 
measurement in situ is ideally suited to illuminate the dynamics of human experience 
from the actor’s perspective, balancing decades of research that privileged the observer’s 
goals. 
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CHAPTER 3 


Why Researchers Should Think 
“Within-Person” 


A Paradigmatic Rationale 


ELLEN L. HAMAKER 


M ost psychological research is based on analyzing (relatively) large samples such that 
the results can be generalized to the population from which the sample is taken. 
Allport (1946) used the term nomothetic for this kind of research, pitting it against 
idiographic research, which is concerned with the study of particular individual. When 
Windelband introduced these terms in 1894, however, nomothetic research implied a 
search for general laws, such as is customary in the natural sciences, whereas idiographic 
research implied a focus on particularities, such as is done in the humanities (Windel- 
band, 1894/1980). Hence, by linking the term nomothetic to the large-sample approach, 
Allport—unintentionally—contributed to the now widespread belief that the large-sample 
approach is concerned with the search for general laws. For instance, Kenny, Kashy and 
Cook (2006) state in the introduction of their book on dyadic research: “In nomothetic 
analysis, research is conducted across many dyads, and the focus is on establishing gen- 
eral laws that apply to all dyads of a similar nature” (p. 10). 

Despite the commonplaceness of this belief, it has been pointed out repeatedly that 
the large-sample approach is not actually concerned with finding general laws that apply 
to each and every individual in the population, or even to a majority of the individuals 
in a population (cf. Grice, 2004; Krauss, 2008; Lamiell, 1990, 1997, 1998). Rather, it 
is concerned with the distribution of variables at the level of the population through the 
use of summary statistics such as means, variances, proportions, and correlations. These 
summary statistics describe characteristics of the sample and help to determine what 
applies to the aggregate—that is, when averaging across individuals (Lamiell, 1998). 
However, what applies in aggregate is not necessarily informative on what is true in 
general, where the latter implies it is true for each and every individual in the population 
(Baldwin, 1946; Cattell, 1966; Danziger, 1990; Epstein, 1980; Grice, 2004; Hamaker, 
Dolan, & Molenaar, 2005; John, 1990; Krauss, 2008; Lamiell, 1990, 1997, 1998; Nes- 
selroade, 2001, 2002; Skinner, 1938). 
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To illustrate what may happen when we try to generalize the results obtained for 
the aggregate to the general, consider the following example. Suppose we are interested 
in the relationship between typing speed (i.e., number of words typed per minute) and 
the percentage of typos that are made. If we look at the cross-sectional relationship (i.e., 
the population level), we may well find a negative relationship, in that people who type 
faster make fewer mistakes (this may be reflective of the fact that people with better 
developed typing skills and more experience both type faster and make fewer mistakes). 
This negative relationship across people is depicted in the left panel of Figure 3.1. If we 
were to generalize this result to the within-person level, we would conclude that if a par- 
ticular person types faster, he or she will make fewer mistakes. Clearly, this is not what 
we expect: In fact, we are fairly certain that for any particular individual, the number of 
typos will increase if he or she tries to type faster. This implies a positive—rather than 
a negative—relationship at the within-person level. In the right panel of Figure 3.1, the 
within-person relationship between these two variables is depicted for a number of indi- 
viduals. It also shows that because the individual means differ in the opposite direction 
and the variability in individual means is much larger than the variability within persons, 
the standard large-sample approach based on taking a cross section will result in a nega- 
tive relationship (see also Nezlek, 2001). 

In the previous example, everybody is characterized by the same within-person rela- 
tionship between the percentage of typos and typing speed, and this can thus be inter- 
preted as a general law, which deviates from the relationship found cross-sectionally. 
However, it is also possible that within-person relationships are characterized by indi- 
vidual differences such that no general law applies to each person. Either way, the bottom 
line is that studying large representative samples—as is common practice in psychologi- 
cal research—does not provide us with information on general laws. Hence, qualifying 
large-sample research as nomothetic, and thus—indirectly—as more scientific than other 
approaches because it is concerned with finding general laws, is erroneous. 


Cross-sectionally In general 


percentage 
of typos 

percentage 
of typos 


number of words number of words 
per minute per minute 


FIGURE 3.1. Left: The cross-sectional relationship between typing speed and percentage of 
typos. Right: The within-person relationship for a number of persons. 
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More and more researchers are becoming aware of this as they acknowledge the 
limitations of the large-sample approach for the study of processes that unfold within 
individuals over time. To really study the latter, we need to observe the process as it is 
taking place, using intensive sampling methods over time. In addition, it requires us to 
think about new ways of analyzing such data, ideally at the level where it is taking place, 
within the person. This handbook forms a timely and valuable reaction to this growing 
awareness: Not only does it include extensive information on how to collect intensive 
longitudinal data but it also discusses diverse techniques for analyzing such data, empha- 
sizing the different aspects on which one may wish to focus. In addition it includes a 
number of applications in different fields of psychology that help to illustrate the issues 
brought forward in the earlier chapters, and bring the between-person versus within- 
person debate to life within specific contexts of psychological research. 

This chapter presents further reasoning for taking an alternative research approach to 
the study of processes that unfold within individuals over time as part of their daily lives. 
To this end I focus on three issues. First, I present a brief historical account that shows 
the large-sample approach is not necessarily the only appropriate research approach in 
psychology. Second, I discuss the limitations of this approach, specifically, if our interest 
is in studying psychological processes that take place within individuals. Finally, I discuss 
several alternatives to the standard large-sample approach that allow us to take a closer 
and more detailed look at the processes as they are occurring in daily life. 


A Brief Historical Account 
of the Large-Sample Approach in Psychology 


The foundation of the first psychological laboratory by Wilhelm Wundt in 1879 in Leipzig 
is often identified as the beginning of psychology as an empirical science. However, Dan- 
ziger (1990) identifies three contemporary movements that also had a major impact on 
establishing psychology as a separate science. While these three movements coexisted 
for some time and each developed its own research methods, in the end the large-sample 
approach associated with one of these movements became accepted as the most appropri- 
ate way of obtaining psychologically relevant scientific knowledge. In this section I follow 
Danziger’s discussion of the three historical movements and how current psychological 
research practice is related to them. 

Wundt, who was trained as an experimental physiologist, was the first to apply 
experimental methods from physiology to philosophical questions. In experimental phys- 
iology, normal individuals were considered to have the same physiology. Wundt applied 
a similar view to mental processes through considering individual persons as instances 
of certain common human characteristics. His experiments focused on replicating the 
results found in one individual using other individuals, while controlling for variables 
that caused interindividual differences. If attempted replication failed, introspection was 
used to identify possible causes of interindividual differences (Thorne & Henley, 1997). 
While his approach can be characterized as a single-subject approach, note that it was 
nomothetic rather than idiographic: Not the individual’s idiosyncrasies, but the universal 
principles that characterize normal minds were considered relevant. 

An alternative experimental psychology was initiated around the same time by Jean- 
Martin Charcot. In 1862 he became head of the Salpêtrière Hospital in Paris, where hys- 
terical patients were treated (Boring, 1929). At first, Charcot thought that hysteria was 
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a physiological disorder, but he became convinced of the psychological basis after he 
was confronted with a patient suffering from hypnotically induced hysterical symptoms 
(Thorne & Henley, 1997). In 1878, using hysterical patients, he began to demonstrate the 
effects of hypnosis before audiences of students and colleagues (Boring, 1929). This exper- 
imental form of psychology is often identified as the beginning of clinical psychology. 

A third contemporary movement was initiated by Francis Galton, who was inspired 
by the theory of evolution of his cousin Charles Robert Darwin and applied the idea of 
natural selection to human characteristics (Cowles, 1989; Desrosiéres, 1998). In order to 
study the importance of heredity of such characteristics, Galton established his anthro- 
pometric laboratory at the International Health Exhibition in London in 1884, where he 
measured the mental faculties as well as the physical appearances of over 9,000 visitors 
(Cowles, 1989; Desrosiéres, 1998). His student Karl Pearson developed the product- 
moment correlation coefficient, which offered the opportunity to investigate the relation- 
ship between two variables without manipulating either one. As such it facilitated the 
formulation of a psychology that was scientific, though not experimental. Galton and 
Pearson also collaborated with several others in founding Biometrika, a journal started 
in 1901 and intended to disseminate articles on statistical tools to study heredity and 
natural selection (Cowles, 1989; Desrosiéres, 1998). In the editorial of the first appear- 
ance of Biometrika, it is written that the unit of their inquiry “is not the individual, but 
a race, or a statistically representative sample of a race” (Weldon, Pearson, & Davenport, 
1901, p. 2). Hence, we can state that the psychology stemming from the work by Galton 
and Pearson is a psychology focused on interindividual or between-person differences. 
This kind of research has sometimes been referred to as variable oriented (e.g., Magnus- 
son, 1998), as it focuses on the distribution of variables in the population rather than on 
individual persons. 

The coexistence of these three movements at the beginning of psychology shows that 
there is no a priori subject of psychology: What is considered the appropriate subject of 
scientific psychology and how psychologically relevant scientific knowledge is obtained 
is a matter of consensus, depending on time, place, and culture. Danziger (1990) shows 
that while studies based on the single-subject approach were common in psychological 
scientific journals at the end of the 19th century, the large-sample approach superseded 
the single-subject approach in the first decades of the 20th century. This development 
continued until psychological research was almost exclusively based on the large-sample 
approach as we know it today. The reason for this triumph of the population (Danziger, 
1990) has been sought in the practical applicability of this form of psychology to real-life 
problems (see Thorne & Henley, 1997). Specifically, the successful use of large-sample 
psychology for selection of people for the army during World War I, and of pupils in 
the differentiating educational system at the beginning of the 20th century, eventually 
resulted in the widespread belief that the large-sample approach is the only appropriate 
method to obtain psychologically relevant scientific knowledge. 


Limitations of the Large-Sample Approach 


Psychology is concerned with both trait-like phenomena and processes. While defining 
traits as individual features that do not change over time and situations—at least not in 
the short run—is rather strict, it is a useful definition when it is pitted against processes 
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(cf. Hamaker, Molenaar, & Nesselroade, 2007). A process implies some kind of within- 
person variability and is thus concerned with states rather than traits. Interestingly, this 
trait-process distinction parallels the distinction between correlational research and 
experimental research to some extent (Borsboom, Kievit, Cervone, & Hood, 2009): 
While correlational research was historically concerned with trait-like phenomena in 
which within-person variability is considered a nuisance, experimental research is—by 
definition—concerned with processes as it focuses on the effect of an independent vari- 
able on a dependent variable, implying a change at the individual level. 

The value of experimental research in establishing the existence of causal relation- 
ships is unrefuted. However, the results from experimental research are also considered 
by many as limited for our understanding of real life, because the translation from the 
laboratory to natural settings is not always straightforward (see also Chapter 1 by Reis, 
this volume). Moreover, comparing means across conditions—a typical approach in 
experimental research—provides a rather crude and shallow picture of the actual under- 
lying process. For these reasons psychologists have sought alternative approaches that 
allow them to gain more insight into processes as they occur in real life. Historically, 
the most common way of studying naturally occurring processes was to study differ- 
ences between people using a large-sample approach, typically only cross-sectionally, but 
sometimes also longitudinally. This kind of process research is based on the assumption 
that results at the population level somehow reflect within-person processes, such that 
generalizations from the population to the individual are meaningful. 

In this section I begin by discussing—in an intuitive manner, rather than in a for- 
mal statistical way—when the results obtained with the large-sample approach can be 
generalized to the individual. Because the conditions needed for such a generalization to 
be legitimate are very unlikely to hold, I discuss more likely scenarios and discuss how 
the results obtained at within-person and between-person levels contribute to the results 
obtained in standard large-sample research. Finally, I discuss how these issues are related 
to the existence of general laws. 


Ergodicity 


When standard large-sample research is used to study within-person processes, this is 
based on the assumption that the distribution of variables in the population reflects the 
distribution within an individual, where the latter is associated with the psychological 
process (Epstein, 1980; Lamiell, 1990; Magnusson, 1998; Nesselroade, 1991). This 
assumption was phrased explicitly by McCrae and John (1992): “Personality processes, 
by definition, involve some change in thoughts, feeling and actions of an individual; all 
these intra-individual changes seem to be mirrored by interindividual differences in char- 
acteristic ways of thinking, feeling and acting” (p. 199). The fundamental question here is 
whether this assumption is tenable; that is, is it possible to generalize from the population 
level to the within-person level? 

At the beginning of this chapter we have already seen an example that clearly illus- 
trated this is not always valid. Molenaar (2004) has pointed out that the generaliza- 
tion from the population to the individual is appropriate only when certain specific 
mathematical-statistical conditions, which are known as ergodicity (see also Molenaar 
& Campbell, 2009), are met. In short, applying the concept of ergodicity to psychologi- 
cal phenomena implies that all population moments (e.g., means, variances, covariances) 
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must be identical to the corresponding within-person moments. The consequences of this 
are far-reaching. For one, it means that structural changes over time must be absent. This 
implies that all developmental processes are by definition nonergodic. Another conse- 
quence is that all within-person moments must be identical across individuals. To illus- 
trate the impact—and untenability—of this consequence, let us consider four moments: 
the mean, the variance, the concurrent covariance, and the lagged covariance. 

First, ergodicity implies that all individuals have the same average over time (which 
in turn is identical to the cross-sectional population mean). Most psychologists dismiss 
this as an unlikely scenario: In fact, we expect people to differ with respect to their aver- 
age level of, for instance, positive affect, reflecting trait-like differences between individu- 
als (cf. Fleeson, 2001; Hamaker et al., 2007). 

Second, ergodicity implies that every person is characterized by the same amount 
of within-person variance (which in turn is identical to the amount of between-person 
difference when taking a cross-sectional perspective). Again, many psychologists dismiss 
this idea, and empirical studies show that people may vary with respect to their within- 
person variability (cf. Epstein, 1980; Fleeson, 2001; Kernis, 2005). 

Third, ergodicity implies that the covariance between two variables is the same 
across individuals (and in turn is identical to the covariance observed across people). 
While the limiting quality of this consequence may be less obvious at first sight, a long 
line of research focuses specifically on the possibility that within-person covariances may 
differ across people. Specifically, applications of Cattell’s P-technique (see also the chap- 
ter by Brose & Ram, Chapter 25, this volume), have shown that relationships between 
variables may differ from one individual to the next, resulting in different factor struc- 
tures. 

Finally, ergodicity also implies that all individuals are characterized by the same 
lagged covariances (and that these are in turn the same as the relationship found across 
people when considering the same lag). A lagged relationship is the relationship between 
two variables at a certain distance in time (i.e., the lag). It may concern the same vari- 
able—such as when we are looking at the relationship between today’s negative affect 
and yesterday’s negative affect—or two different variables, for instance, when we look at 
today’s negative affect and yesterday’s positive affect. Although, of the four phenomena 
discussed earlier, this source of individual differences has probably been studied the least, 
there is a growing body of evidence for individual differences being the norm rather than 
the exception when considering these moments (e.g., Kuppens, Oravecz, & Tuerlinckx, 
2010; Suls, Green, & Hillis, 1998; Suls & Martin, 2005; Wenze, Gunthert, Forand, & 
Laurenceau, 2009). 

From this it is clear that the constraints needed for ergodicity are very unlikely 
to hold in psychological practice. Hence, I feel it is safe to say that most—if not all— 
psychological phenomena are nonergodic. The consequence of this is that we cannot 
blindly base statements about processes that take place within people on results obtained 
with standard large-sample analyses. 


Within-Person Variability, Between-Person Differences, 
and Population Results 


To further deepen our understanding of the virtues and limitations of the standard large- 
sample approach, it is useful to take a more detailed look at the distinct sources of vari- 
ance that are likely to contribute to our observations. We can think of a measurement 
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taken at a particular occasion from a particular person as the sum of two components, 
that is: 


Xie = Wj + Sit (1) 


where y; represents the person’s mean over time, which we may thus refer to as the person’s 
trait score, and s,,is the person’s temporal deviation from this mean, which we may refer to 
as the person’s state at occasion t (Hamaker et al., 2007). This rather basic model can also 
be recognized as part of a simple multilevel or random effects model for longitudinal data: 
It can form the expression at the first or within-person level (e.g., s ~ N(0, w)) , while u, 
is further modeled at the second or between-person level (e.g., u; ~ N(u, 8)). 

Despite its simple appearance, however, the implications from Equation 1 are far- 
reaching: It shows that when we take a cross section and analyze these data, both within- 
person (or state-like) aspects and between-person (or trait-like) aspects influence our 
results. The degree to which our cross-sectional results represent the within-person or 
the between-person reality depends on the relative contributions that these two sources 
make. To elaborate on this, Figure 3.2 contains four scenarios in which the focus is on 
the relationship between two variables. 

In the first scenario, both the correlation and the variances at the within-person level 
are identical to those at the between-person level. In this case, the cross-sectional result 
will be characterized by this same correlation. In the second scenario, the variances are 
the same at the two levels, but the correlation at the between-person level is now zero. 
In this case the cross-sectional result is characterized by a correlation that is the average 
of the two correlations. In the third scenario, the variances are again equal, but now the 
correlations at the two levels are opposite. This results in a zero correlation for the cross- 
sectional approach. Finally, like the third, the fourth scenario is characterized by opposite 
correlations at the two levels, but in addition the variances at the between-person level 
are four times larger than the variances at the within-person level. The cross-sectional 
result is now dominated by the between-person structure, although the negative within- 
person correlation leads to a lower correlation for the cross-sectional approach than the 
correlation at the between-person level. 

These examples show that the variances and the correlations obtained within the 
cross-sectional approach are influenced by the variances and correlations at both the 
within-person and the between-person level. The cross-sectional variance of a variable 
is simply the sum of the variances at both levels. The cross-sectional correlation is a 
weighted sum of the within-person and the between-person correlations, where the rela- 
tive contributions depend on the relative variances at these two levels. As a result, the 
cross-sectional relationship may more closely resemble the within-person relationship 
or the between-person relationship. It is important to note that these issues apply not 
only to concurrent relationships in standard cross-sectional research, but also lagged 
relationships in longitudinal research (whether it concerns the same variable or different 
variables measured at different occasions). 

None of the four sketched scenarios represents ergodicity: The latter would imply 
there are no mean differences between individuals, such that there would be no variance 
at the between-person level, and consequently, the cross-sectional results would coincide 
with the within-person results. The scenarios discussed earlier show that there may be 
specific nonergodic situations in which at least some results found at the cross-sectional 
level are identical to the corresponding within-person results. 
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FIGURE 3.2. Four scenarios of relationships between two variables at the within-person and the 
between-person level, and the resulting cross-sectional relationship. In the first three scenarios 
within-person variability and between-person variability are equally large, such that both struc- 
tures contribute equally to the cross-sectional result. In the fourth scenario, the between-person 
variance on both variables is four times larger than that at the within-person level, such that the 
cross-sectional result is dominated (but not determined) by the between-person relationship. 
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General Laws 


When ergodicity does not hold, such that the within-person results are not mirrored by 
the results obtained from standard cross-sectional or longitudinal research, general laws 
that apply to each and every individual in the population may still exist. To illustrate this 
consider the hypothetical examples in Figure 3.3. In the left panel, the lagged relation- 
ship between today’s negative affect and yesterday’s negative affect is shown, both for the 
within-person level and the between-person level. It shows that everybody is character- 
ized by the same within-person relationship, which could thus be considered a general 
law. It also shows that the correlation between the lagged affect scores is the same across 
levels, a feature that has been referred to as homology across levels (Hamaker et al., 
2005; Hannan, 1971). However, ergodicity is clearly absent: Not all individuals have the 
same mean, and the amount of within-person variability is much smaller than the vari- 
ability found cross-sectionally. Hence, in this scenario, there is no ergodicity, but there is 
a general law, and the standardized within-person results are the same as the standard- 
ized results obtained with standard large-sample research. 

In the middle panel, the relationship between positive and negative affect is depicted. 
It shows that while all individuals are characterized by the same negative within-person 
relationship between these affective states, the relationship at the between-person level is 
absent. Clearly, in this case, there is no ergodicity, nor is there homology. However, even 
in this case there is still a general law that applies to all individuals, similar to the example 
given at the beginning of this chapter. 

Finally, the right panel of Figure 3.3 contains a situation in which individuals differ 
from each other with respect to their within-person relationship between two variables, 
here sociability and shyness. While at the between-person level we find that people who 
score high on shyness tend to be less sociable than those scoring low on shyness, only 
some individuals are characterized by the same negative relationship at the within-person 
level. Other individuals are characterized by either an independence between these states 
or a positive relationship. The latter would imply that when the individual’s need for 
sociability increases, his or her shyness also increases, and vice versa. Hence, there is no 
ergodicity and no homology, nor is there a general law. Still, it could be interesting to 
investigate this situation, especially when there are possible third variables that moderate 


sociable 


positive affect 


negative affect yesterday negative affect shy 


FIGURE 3.3. Examples of possible within-person and between-person relationships between 
variables. 
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the within-person relationship. For instance, Suls and Martin (2005) found that the rela- 
tionship between mood and problems was moderated by neuroticism, in that individuals 
high on neuroticism were characterized by a stronger relationship between mood and 
problems than were individuals low on neuroticism. 

It should also be noted that in many studies that focus on the possible differences 
between within-person and between-person relationships of variables within a multilevel 
approach, it is implicitly assumed that there is a general law; that is, when individual dif- 
ferences at the within-person level are considered, this is often restricted to differences in 
means, while the relationships between variables (or even the variances) are held constant 
across people. The resulting description of the within-person process may well deviate 
from what is true in the aggregate, that is, when just averaging across individuals using 
the traditional large-sample approach: The latter mixes both the average within-person 
results and the between-person variation in within-person means (see Figure 3.2), while, 
in contrast, the averaged within-person results represent an average that is not influenced 
by between-person differences in within-person means. In other words, it provides a 
description of the average within-person process. 

The advantage of modeling only between-person differences in means and assuming 
that all other within-person moments are identical across individuals is that it is compu- 
tationally faster and can be easily done in standard multilevel software packages. How- 
ever, it lacks the detail and nuance that could be obtained when considering individual 
differences with respect to other within-person features, such as variances or (lagged) 
covariances (cf. Hoffman, 2007). If we allow for individual differences in within-person 
relationships, we may find that these relationships are actually identical across individ- 
uals, indicating there is some general law. Zautra, Affleck, Tennen, Reich, and Davis 
(2005), for instance, found that while there were individual differences in the relation- 
ship between negative affect and negative events, the relationship between negative affect 
and positive events showed no individual differences, implying a general law that applies 
to everyone. It is also possible that even though the average within-person relationship 
is zero, there are individual differences with respect to this relationship. For instance, 
Wenze and colleagues (2009) found that while the average effect of anger on subsequent 
depression was not significant, this relationship was significantly moderated by dyspho- 
ria, in that people high on initial dysphoria were characterized by a positive relationship 
from anger to depressed mood, while for those low on initial dysphoria, anger actually 
decreased depressed mood. 

In short, to determine whether everybody is characterized by the same within-person 
relationship, and if not, what predicts the differences, we must include the possibility of 
between-person differences in within-person relationships in our modeling strategy. If 
there are no individual differences, we can conclude there is evidence for a general law, 
and when there are individual differences that cannot be ignored, we can further investi- 
gate to see what predicts them or what they predict. 


Alternatives to the Large-Sample Approach 


Acknowledging that cross-sectional and other standard large-sample approaches are lim- 
ited in their capability to inform us on processes that take place at the within-person level 
has led researchers to search for alternative approaches. With respect to developmen- 
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tal processes this has resulted in the development of random effects models, known as 
(latent) growth curve models, which allow for individual differences in the parameters 
that describe the developmental trajectories (cf. Bollen & Curran, 2006; Meredith & 
Tisak, 1984, 1990). Recently, valuable extensions of these models have been proposed 
that allow for more flexible trajectories. For instance, Wang and McArdle (2008) com- 
bined the latent growth curve model with a random effects turning point (or change point) 
model, to allow for a sudden change in the developmental trajectory. Dolan, Schmittman, 
Lubke, and Neale (2005) combined the latent growth curve model with a hidden Markov 
model instead, such that individuals may switch from one subgroup to another, while 
each subgroup is characterized by its own developmental process that is modeled as a 
latent growth curve model. 

For stable or stationary processes, which are characterized by within-person vari- 
ability in the absence of structural (i.e., developmental) changes (Nesselroade, 1991), 
the modeling possibilities have developed more slowly. This may be partly attributed 
to the fact that many researchers were not convinced that the cross-sectional approach 
and longitudinal large-sample approaches are inappropriate for the study of these pro- 
cesses. However, with the growing awareness about the strong assumptions on which 
such research is based (i.e., ergodicity), and the possible differences between the within- 
person and between-person results obtained in multilevel research, there is also a grow- 
ing interest in alternative modeling strategies for stable processes. In this section I discuss 
two major approaches: the single-subject approach based on time series analysis, which 
has a relatively long history in psychological research, and the multilevel approach based 
on time series models, which is more novel. 


Single-Subject Approach: Time Series Analysis 


Time series analysis is specifically concerned with the relationships of one or more vari- 
ables to themselves and each other over time within a single case. It is a class of techniques 
that are frequently used not only in econometrics but also in disciplines such as hydrol- 
ogy, engineering, biology, and—to some extent—psychology. The latter applications are 
based on viewing the individual as a stochastic system that varies over time according to 
specific probabilistic laws (Chassan, 1959). In this view, time series analysis is a technique 
that allows one to uncover the structure of within-person variability of a particular indi- 
vidual. While this makes time series analysis a powerful tool for the study of psychologi- 
cal processes, relatively few studies in psychology are based on this approach. Below is a 
primer on time series analysis in psychological research. 

Time series data are often characterized by autocorrelation, which implies that con- 
secutive observations are not independent of each other. A rather simple way of modeling 
such temporal dependency is by use of a first-order autoregressive (AR[1]) model. An 
AR(1) process is expressed as 


Y= C + OY +U (2) 


where the autoregressive parameter @ captures the temporal dependency, c is a constant 
that is a function of the individual’s mean and autoregressive parameter (i.e., c = (1-9) 
u), and u, is the unpredictable part, which is referred to as the innovation, random shock, 
or residual. 


54 THEORETICAL BACKGROUND 


The autoregressive parameter reflects the degree to which the current observation y, 
can be predicted from the previous observation y,_. It is identical to the lag 1 autocorre- 
lation—that is, the correlation between y, and y,_, (Hamilton, 1994)—and, as a result, 
it does not change when the data are standardized. While it necessarily falls within the 
range of -1 and 1, in most psychological applications it lies between 0 and 1. It has been 
interpreted as a measure of inertia (Gottman, Murray, Swanson, Tyson, & Swanson, 
2002; Kuppens, Allen, & Sheeber, 2010), reflecting a person’s reluctance to change and 
a tendency to ruminate, or as a measure of carryover or spillover from one measurement 
to the next (Suls & Martin, 2005). Another way of thinking about the autoregressive 
parameter is that it is inversely related to regulatory strength; that is, if an individual is 
characterized by a strong regulatory mechanism, he or she will return to his or her base- 
line quickly after being perturbed, and this is reflected by an autoregressive parameter 
close to zero; in contrast, an individual who is characterized by a weak regulatory mecha- 
nism will recuperate less quickly, and it will take him or her longer to return to his or her 
baseline, which is reflected by an autoregressive parameter closer to 1. 

The innovation u, is unrelated to prior observations and is thus the unpredictable part 
of the current observation. It can be conceived of as everything that took place between 
occasion t- 1 and occasion ¢ that influences the process under investigation. This implies 
that it is a (weighted) sum of (infinitely many) internal and external events, which are 
somehow relevant for this particular process in a particular person. Such events may 
include conversations the person participated in, for instance, with colleagues, peers, or 
family members; the person’s alcohol or coffee consumption; information the person was 
confronted with through books, media, and e-mails; and the appraisal, thoughts, and 
memories experienced by the person. All of these events—some of which will have a posi- 
tive influence, while others will have a negative influence on the process—are collectively 
represented in the innovation. Note that in using this model we do not have to further 
define or specify what is included in the innovation. There is however an important and 
well-known relationship between the variance of the innovations, 02, the variance of the 
observed series, 63, and the autoregressive parameter (Hamilton, 1994), that is: 


2 

=; a (3) 
The AR(1) process described is one of the most basic models known in time series 
analysis, and it has many extensions. For instance, for a multivariate rather than a uni- 
variate time series, we can use a vector AR(1) process in which a vector with observations 
at occasion t£ is regressed on the vector with preceding observations. The matrix with 
regression parameters contains the autoregressive parameters on the diagonal, by which 
the variables are regressed upon themselves, and the cross-lagged regression coefficients 
as the off-diagonal elements, by which the variables are regressed on each other at the 
previous occasion. This model is particularly appealing because it allows for investigat- 
ing whether there is evidence for causal relationships in correlational data (i.e., Granger 
causality; Granger, 1969). An illustrative application can be found in Schmitz and Skin- 
ner (1993), who used this model to investigate the lagged relationships between perceived 
control, effort, and performance in five children, and compared these results to the cross- 

sectional results. 
Instead of modeling the autoregressive and cross-lagged regressive relationships 
between the observed variables directly, one may also assume that several variables mea- 
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sure an underlying construct (e.g., anxiety or happiness). In that case, the autoregressive 
model can be combined with a common factor model, such that the lagged relationships 
are modeled at the latent level. Examples of this can be found in Ferrer and Nesselroade 
(2003), who used this model to investigate the way in which the daily mood of each per- 
son in a married couple was influenced by his or her partner’s mood the preceding day, 
and in Hamaker and colleagues (2005), who used this model to investigate the equiva- 
lence in factor structure across individuals. 

A model that has been considered less frequently in psychology, but which may have 
a lot of potential for studying fluctuations associated with daily life, is the autoregressive 
integrated moving average (ARIMA[0,1,1]) model. Peterson and Leckman (1998) used 
this model to investigate the temporal structure in inter-tic intervals in 22 patients with 
Tourette syndrome. Fortes, Deligniéres, and Ninot (2004) used this model for studying 
the fluctuations in self-esteem and physical self in four adults. They indicated that an 
ARIMA(0,1,1) can be thought of as a homeostatic regulation process, in which the sys- 
tem is more or less likely to return to baseline after being perturbed: If the system returns 
to baseline quickly, this is referred to as preservation, but if it takes a system a long time 
to wash out the effect of a perturbation, this is referred to as adaption. 

Other advanced time series models that can be useful for studying psychological 
processes are based on regime-switching. Such models are characterized by two (or more) 
equilibria, and over time the system temporarily settles around one of these, while occa- 
sionally switching to another equilibrium. Hence, these equilibria can be thought of as 
the attractors of the system. Warren and his colleagues used such models to investigate the 
usefulness of abstinence versus controlled drinking in a single individual diagnosed with 
alcoholism (Warren, Hawkins, & Sprott, 2003), to model the switches between recovery 
and relapse in seven developmentally disabled sex offenders (Warren, 2002), and to inves- 
tigate the nonlinearity in fluctuations of daily highs in anxiety and ruminations in a male 
being treated for anxiety disorder (Warren, Sprott, & Hawkins, 2002). Hamaker, Gras- 
man, and Kamphuis (2010) compared a number of different regime-switching models to 
investigate episodes of mania and depression in a patient diagnosed with rapid cycling 
bipolar disorder. 

Another model that has been extremely popular in econometrics is the cointegration 
model. While the vector AR(1) model, discussed earlier, is useful for modeling short-term 
dependency of variables on one another, the cointegration model was designed for multi- 
variate series that may seem independent in the short term but exhibit long-term depen- 
dencies (Engle & Granger, 1987). In psychology this model has thus far been applied by 
Stroe-Kunold and Werner (2008) to model the physical and psychological symptoms of a 
couple involved in a long-term relationship. 

In all the examples discussed earlier, a truly single-subject approach was taken, 
whether data were available from only one person or more than one person. In the 
latter case, the results obtained per person were compared afterward in a mainly non- 
statistical manner. However, when there are enough individuals for whom time series 
are obtained, one could also consider the possibility of using a multilevel approach, in 
which Level 1 is formed by the repeated measurements, while Level 2 is formed by the 
individuals. While this is more restrictive than the replicated single-subject approach 
discussed earlier—as we must make the assumption that the individuals come from the 
same population and are (at least qualitatively) characterized by the same process—the 
advantage is that it allows us to investigate the quantitative differences between indi- 


56 THEORETICAL BACKGROUND 


viduals and search for possible predictors of these. I discuss below a number of studies 
based on this approach. 


Multilevel Models Based on Time Series Models 


The use of multilevel analysis for repeated measures in the absence of developmental tra- 
jectories has focused on between-person differences in within-person means and—to a 
lesser extent—on between-person differences in within-person instability. The latter has 
been of interest because instability has been related to diverse psychological disorders. 
However, Jahng, Wood, and Trull (2008) point out that within-person variance is not 
necessarily a good indicator of instability: If there is a high degree of temporal depen- 
dence (i.e., autocorrelation), the within-person variance may be large, while the moment- 
to-moment instability is actually quite small. Therefore, they propose to use the mean 
squared successive difference (i.e., the average squared change from one occasion to the 
next) as a measure of instability. While this measure has several desirable features—such 
as the fact that it can be used both for data containing a trend and for stationary data—a 
disadvantage is that it does not provide us with a model that describes the actual process 
under investigation (see also Eid, Courvoisier, & Lischetzke, Chapter 21, this volume). 

A few researchers have used an actual time series model at Level 1 as the point of 
departure for their multilevel model. Note that this differs from specifying an AR(1) for 
the error covariance structure, which is an option included in most multilevel software 
packages: The latter implies that all individuals are characterized by the same autoregres- 
sive parameter and the same innovation variance, while the means or trends may differ 
across individuals. In contrast, taking a time series process such as an AR(1) as the Level 
1 model implies that the focus is specifically on individual differences in temporal depen- 
dency. 

An early example of this can be found in Suls and colleagues (1998), who used a 
multilevel approach to model daily mood ratings and their relation to perceived problems. 
One of their aims was to investigate whether individuals characterized by higher neuroti- 
cism scores have inappropriate or maladaptive coping methods to handle daily problems. 
A second research question was whether these people have a harder time repairing their 
mood, which is conceptualized as emotional inertia. 

At the first level of their model, Suls and colleagues (1998) predicted today’s mood 
from yesterday’s mood (as in an AR(1) model), and from today’s problems. At the sec- 
ond level, both the random autoregressive parameter and the random regression coef- 
ficient for problems were regressed on the individual’s personality characteristics. These 
analyses showed that the autoregressive parameter could be predicted from neuroticism, 
in that people who scored higher on neuroticism had higher autoregressive parameters 
than people who scored lower on neuroticism. This confirmed the researchers’ expecta- 
tions that people who score high on neuroticism have more difficulty with repairing their 
mood and returning to baseline than people who score low on neuroticism. 

More recently, Rovine and Walls (2006) used an AR(1) process to model the amount 
of alcohol consumed on a daily basis in a group of moderate to heavy social drinkers. 
They included gender as a time-invariant covariate and desire to drink on the preceding 
day as a time-varying covariate. They concluded that desire to drink had an important 
effect on the amount of alcohol consumed, and that this was even stronger for men than 
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for women. Furthermore, the effect of the amount of alcohol consumed the preceding 
day was relatively small (autoregressive parameters ranged from about -.2 to .2) and cen- 
tered around zero, indicating different kinds of regularity: While a positive autoregressive 
parameter implies a spillover effect from one day to the next, a negative autoregressive 
parameter results in a pattern that can be interpreted as a kind of compensation behav- 
ior, such that days on which relatively little alcohol is consumed are followed by days on 
which more alcohol is consumed, and vice versa. 

Kuppens, Allen, and colleagues (2010) focused specifically on the relationship 
between emotional inertia (i.e., the autoregressive parameters of eight different emo- 
tions), and trait-like self-esteem. They concluded that higher self-esteem was associated 
with lower emotional inertia in self-reported diary data (although not for all emotions). 
In a second study, they looked at the inertia in observational data for depressed versus 
nondepressed adolescents interacting with their parents on diverse tasks. It showed that 
depressed adolescents had higher levels of inertia (for both positive and negative emo- 
tions) than nondepressed adolescents in conflict tasks and reminiscence tasks, but that 
there were no significant differences in average inertia between these groups on positive 
tasks. This shows that psychological states of depressed adolescents tend to be more 
pervasive than those of nondepressed adolescents, although this partly depends on the 
emotionally evocative nature of the interaction. 

Oravecz, Tuerlinckx, and Vandekerckhove (2009) published an excellent study in 
which they went well beyond the accomplishments discussed earlier in several important 
ways. First, their approach is based not on the standard autoregressive model but on the 
Ornstein-Uhlenbeck equation, which is the continuous time equivalent of an AR(1) pro- 
cess. In an AR(1) it is assumed that the intervals between successive observations are of 
equal length (e.g., a day or a second). However, many diary studies are based on random 
intervals between the measurements throughout the day, and such varying intervals dis- 
favor the use of AR(1) or other discrete time models. The Ornstein-Uhlenbeck equation 
is a model for continuous time, such that it can be used for data obtained with varying 
intervals. Second, the model developed by Oravecz et al. includes measurement error, 
such that the temporal process is modeled at the latent—rather than the observed—level: 
This is not through a factor model, as discussed earlier, but instead consists of separating 
signal and noise for a single variable, similar to the approach in a quasi-simplex model. 
Third, the model by Oravecz et al. allows for individual differences in all model param- 
eters; that is, the level, inertia, measurement error variance, and innovation variance are 
all included as random effects in the model. 

The empirical application in Oravecz and colleagues (2009) shows that neurotic peo- 
ple tend to have lower means and more variance on the unpleasant-pleasant dimension, 
while agreeable people tend to have less variability. No relationship between personality 
traits and valence of mood were found. More recently, Kuppens, Oravecz, and colleagues 
(2010) used this model to study the relationship between temporal dependency and other 
personal dispositions. In their first study they found that inertia is related negatively to 
trait measures of reappraisal and positively to trait measures of rumination. This confirms 
the hypothesis about the maladaptive role of a high emotional inertia. In their second 
study—where the measurements were taken at smaller time intervals—the results were 
replicated for reappraisal but not for rumination. This led the authors to suggest that the 
regulation of emotions may be more important in the short run than in the long run. 
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The models discussed here illustrate the great potential of considering time series 
models as the building blocks in new and more advanced multilevel models: Such models 
provide a much richer description of psychological processes than the more common 
multilevel models, and they may prove to be invaluable for our understanding of certain 
phenomena associated with daily life. 


Discussion 


Almost half a century ago, Bereiter (1963) wrote in his opening chapter for the book 
Problems in Measuring Change: “Although it is commonplace for research to be stymied 
by some difficulty in experimental methodology, there are really not many instances in 
the behavioral sciences of promising questions going unresearched because of deficien- 
cies in statistical methodology. Questions dealing with psychological change may well 
constitute the most important exceptions” (p. 3). Since the introduction of latent growth 
curve modeling, this statistical methodological hurdle has been largely taken with respect 
to developmental processes. However, only very recently are we witnessing the emergence 
of solutions for the study of stationary processes. 

This timely and exciting breakthrough can be partly attributed to two recent techni- 
cal developments. First, the introduction of personal digital assistants, mobile phones, and 
online questionnaires has substantially eased the gathering of data associated with daily 
life. As a result, more and more researchers have datasets that consist of large amounts 
of repeated measurements from the same people. Second, the increased computational 
power of computers has allowed for new and more complicated multilevel models by 
which we can model such intensive longitudinal data. Specifically, there is a new develop- 
ment that targets investigating processes that take place at the within-person level, while 
simultaneously borrowing strength from other cases by combining these within-person 
models in novel multilevel models. 

With these technical developments in place, it is expected that in the next decade we 
will see an increase in the development and application of new multilevel models, which 
are much better tailored to the specific hypotheses that researchers want to investigate. 
Now it is up to the researchers to determine what questions they actually wish to inves- 
tigate, and to choose the appropriate statistical tools for answering these questions. For 
instance, one can ask not only whether children with more behavioral problems increase 
the level of rejection in their mothers, but also whether an increase in a child’s behav- 
ioral problems leads to an increase in a mother’s rejection: While the first question is a 
between-person question, the second one is a within-person question. Instead of trying to 
derive the answer for the second question by investigating the first, researchers can now 
choose the appropriate method to tackle their research question. 

Researchers have been asking within-person questions since the beginning of psy- 
chology, and for a long time they had no choice but to make the unlikely assumption 
that population results were reflective of within-person phenomena. Now we finally have 
access to statistical techniques specifically developed to tackle within-person questions. 
This implies that the statistical methodological barrier that Bereiter pointed out almost 
50 years ago is finally being overcome and we are on the threshold of an exciting and 
promising new stage in process research. 
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Conducting Research in Daily Life 


A Historical Review 
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ioneers of psychological science, such as Fechner, Wundt, and Ebbinghaus, laid the 
foundation for developing psychology as a systematic experimental discipline. Not- 
withstanding the methodological rigor and clarity inherent in experimentation, Wundt 
himself already was aware of a shortcoming that led him to conclude psychology has at 
its disposal actually “two exact methods” (1896, p. 28): Of these, “the first, the experi- 
mental method, subserves the analysis of more simple mental processes; the second, the 
observation of mental products of general authenticity, subserves the study of higher 
mental processes and developments.” Wundt (1900-1920) therefore proposed the com- 
parative investigation of language, myths, and morality and their development over the 
lifespan in his Völkerpsychologie (literally “peoples’ psychology”). Today, Wundt would 
probably have chosen another approach to complement experimentation: the investiga- 
tion of naturally occurring mental processes and their behavioral expression under natu- 
ral conditions in the context of people’s daily lives. As we demonstrate, precursors of this 
research approach were already present during Wundt’s lifetime. 
In the remainder of this introduction, we briefly summarize four factors that drove 
the development and dissemination of approaches to assess people’s experience and 
behavior in daily life, then we specify the focus of this chapter in more detail. 


Reasons for Conducting Research in Daily Life 


First, a strong driving force for conducting research in daily life has been the endeavor to 
increase ecological validity, usually understood to be the degree to which the results of 
a study can be generalized to people’s behavior in their real or everyday life. As pointed 
out by Hammond (1998), credit for coining the term goes to Egon Brunswik (1947, 
1955). However, in Brunswik’s theorizing, ecological validity had a different meaning. It 
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referred to the probabilistic relationship between the organism and its environment in the 
framework of his Lens model. What researchers call ecological validity today, Brunswik 
(1955, p. 205) labeled ecological generalizability. It can be achieved through representa- 
tive design, which means measures are repeatedly taken in natural situations that are 
randomly selected from the universe of situations. 

Expanding the methodological portfolio from pure laboratory experimentation into 
psychological field research was one way to achieve a higher degree of naturalness and 
thus ecological generalizability of research (Brunswik, 1943, 1955; Tunnell, 1977). The 
resulting need for assessment instruments applicable outside the stationary laboratory 
called for ambulatory assessment. While some of this work grew out of early applied 
research under environment protection policies (see Kaminski, 1976), other contributions 
came from basic research extending classical field theoretical notions (Lewin, 1943) of 
environment-behavior relationship and interaction (see Barker & Wright, 1951; Pawlik 
& Stapf, 1992). Obviously, laboratory and field (or ambulatory) research are complemen- 
tary rather than mutually exclusive, and the choice of the appropriate approach depends 
on the question under study. 

Second, as noted by Schwarz (Chapter 2, this volume), another motivation for research- 
ers to assess people’s experience and behavior in daily life has long been the concern with 
the validity of retrospective or generalized responses obtained with interviews or ques- 
tionnaires. Asking persons about their usual behavior or experience in a questionnaire 
or interview, in contrast to letting them observe and record it in real life, touches upon 
different data sources. Whereas questionnaire and interview methods are often biased by 
memory processes and cognitive heuristics, data captured in real time are less affected 
by such processes and distortions (Fahrenberg, Myrtek, Pawlik, & Perrez, 2007; Reis & 
Gabel, 2000; Schwarz, 2007 and Chapter 2, this volume). As we show in this chapter, 
capture of real-time psychological data today also goes far beyond Cattell’s (1957) early 
plea for the use of life-record data (e.g., educational records, accident rates, conviction 
data), interactive measures (e.g., physical appearance, private lodging) or “the human 
mind itself as a pattern-perceiving and -quantifying device in behavior rating” (p. 52). 

Third, there has been a growing interest in intraindividual variation of human 
behavior and experience across unrestrained real-life conditions. Studying individual dif- 
ferences exclusively in a cross-sectional manner, under standardized conditions, reduces 
important sources of variation, namely, the change of behavior across situations and 
time, and its interaction with trait characteristics. It also ignores that people often choose 
settings and situations differently, according to stable individual preferences (e.g., Bolger, 
Davis, & Rafaeli, 2003; Buse & Pawlik, 1996, 2001). Moreover, there has been a grow- 
ing interest in understanding how behavior and experiences unfold in time. 

Last, as our chapter and this entire handbook show, the development of many meth- 
ods of real-time data capture have been fruits of the tremendous technological progress 
over the last 60 years. 


Focus of the Current Chapter 


Our aim in the following sections is to highlight the origins and important develop- 
ments of major approaches for conducting research in daily life. Our focus is on different 
methods that have often been combined under the umbrella terms ecological momen- 
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tary assessment (EMA; Stone & Shiffman, 1994; Stone, Shiffman, Atienza, & Nebeling, 
2007) or ambulatory assessment (Fahrenberg, 1996a; Fahrenberg et al., 2007). Charac- 
teristic of these methods is that people’s current experiences and behaviors are assessed 
repeatedly, in their natural environments, without or only with minimal latency. 

It would hardly be possible to give a comprehensive historical overview. Therefore, 
we have narrowed our focus on the development of three major approaches used to con- 
duct research in daily life: (1) diaries and related methods to record everyday experiences 
and behaviors, (2) (psycho)physiological monitoring of heart activity, and (3) monitoring 
of physical activity and body movements. For these approaches, we have tried to cap- 
ture technological beginnings, the circumstances under which they were first used, and 
factors that led to their further refinement. For more recent stages of research, we give 
selected examples or refer to literature reviews. Due to space constraints, we shortened 
several paragraphs and cut others away. For a more detailed version of this historical 
review, see Wilhelm and Perrez (2011). 


History of Self and Other Records in Daily Life 


Diaries as written records of current events and experiences have a long history, reaching 
back to antiquity. For example, the Meditations of Marcus Aurelius (121-180) are at least 
in part records of daily reflections. Not less than five different functions of diaries can 
be distinguished: (1) the recording of historically significant events (e.g., the journal of 
Arthur Bowes Smyth, 1787-1789, on Australian settlement); (2) the logging of incidents 
in seafaring expeditions (e.g., the journal of Christopher Columbus, 1492-1493); (3) the 
retention of private daily life experiences (e.g., the journal of Samuel Pepys, 1660-1669); 
(4) the logging of professional incidents, as common in psychotherapy; and (5) the record- 
ing of observations or experiences for scientific documentation and research. 

Charlotte Bühler (1922) and Gordon Allport (1942) proposed the analysis of private 
diaries to obtain insights into people’s functioning during daily life. However, in this sec- 
tion, we focus on scientific diaries and show how concerns about the validity of records 
led to an increase in structure and methodological rules. We begin this retrospective 
review with narrative observation diaries, which became popular more than 200 years 
ago, and end it with a review of experience sampling and contemporary electronic sys- 
tems for momentary assessments. 


Narrative Observation Diaries in Developmental Research 


At the end of the 18th century, scholarly fathers began to publish diary records about the 
development of their children (e.g., Tiedemann, 1787; see Wallace, Franklin, & Keegan, 
1994). In 1877, Darwin published a report on the development of sensory perception 
and emotion expression, based on diary notes he took 37 years earlier. During the first 
2 years of his child’s life, he “wrote down at once whatever was observed” (p. 285). In 
addition, he recorded responses to specific stimuli, which he presented to test the infant’s 
reactions. 

Investigating the development of language, the German physiologist William Tierry 
Preyer (1882/1889) formulated a set of rules to which he adhered “strictly, without excep- 
tion” to achieve “the highest degree of trustworthiness” (p. 187). He demanded that the 
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same child should be observed at least three times every day during the first 1,000 days 
of his life. “Every observation must immediately be entered in writing in a diary that is 
always lying ready” (p. 187). In addition, “every artificial strain upon the child is to be 
avoided, and the effort is to be made as often as possible to observe without the child’s 
noticing the observer at all” (p. 187). Of course, it would be rather difficult to realize this 
scientific rigor in a contemporary research project. Later, developmental psychologists 
such as Clara and William Stern (1900-1918) and Jean Piaget wrote psychological dia- 
ries on the development of their children, too. Based on his diary records, Piaget (1926) 
generated his ideas on the developmental stage theory, which he then tested in experi- 
ments. 

Karl Bühler (1921) saw the strength of using unstructured observation diaries to 
gain an overview on phenomena in a new domain, which then inspires research ques- 
tions and hypotheses. Characteristic of these early observation diaries was that they were 
narrative, and there were no common methodological rules on how observations should 
be performed and recorded. In the 1920s and 1930s, researchers began to agree on such 
rules (Olson & Cunningham, 1934). Space does not permit us to outline the further 
development of observation methods. However, the refinement of techniques to observe 
people’s behavior in their daily life has been at least as diverse and sophisticated as that 
for self-recording methods (e.g., Arrington, 1943; Barker & Wright, 1951, 1955; Ben- 
son, 2010; Intille, 2007; Mehl, Pennebaker, Crow, Dabbs, & Price, 2001; Mehl & Rob- 
bins, Chapter 10, this volume; Rosander, Guterman, & McKeon, 1958; Webb, Campbell, 
Schwartz, & Sechrest, 1966). 


Early Self-Record Studies 


In medical case studies published in the early 20th century, patients sometimes reported 
their symptoms and medically relevant aspects of their behavior in diaries (e.g., Favill & 
Rennick, 1924, cited in Stone et al., 2007; Stone, 1908). With an increase in information 
gathered, the need to structure and standardize the records increased to facilitate analy- 
ses and to ensure comparability of records across situations and between persons. 

An example of an early self-record study is found with Hugo Miinsterberg (1892), a 
pioneer of applied psychology. Convinced that real-life affective states and emotions could 
not be adequately induced in the laboratory, he kept a diary over 9 months to record his 
current mental and affective state three to five times a day, together with the results of 
about 20 experimental tasks. In one task, he had trained himself to measure distances of 
10 and 20 cm with his fingers on a small instrument in his pocket. Analyzing his diary, 
Münsterberg identified three basic affective dimensions to describe the different qualities 
of feelings he had experienced. In addition, he found that he overestimated the distances 
measured with his fingers when he was feeling excited, and underestimated them when 
he was feeling depressed. Münsterberg (1908) suggested that researchers explore the util- 
ity of the finger measurement task as a diagnostic procedure to assess a witness’s true 
emotional state. 

Since the 1920s, diaries have been used increasingly in group studies. To investigate 
individual differences in sense of humor, Kambouropoulou (1926) asked 70 female stu- 
dents to write down the things they laughed about each day for 1 week. Based on qualita- 
tive and quantitative analyses of the humor diaries, Kambouropoulou found substantial 
consistency in the types of humor experienced across situations. 
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Already in 1917, Fliigel (1925) had reported—according to our knowledge—the first 
EMA study. Flügel had severe doubts about the reliability of retrospective reports of 
states of mind and therefore argued for momentary assessment. He trained friends and 
volunteers to apply the method of introspection in their daily lives and record their cur- 
rent affective state every hour for at least 30 days. They rated their degree of pleasure 
on a bipolar scale and gave a description of their current feelings. Based on data of nine 
participants, he found a predominance of pleasure over displeasure. However, individual 
differences were considerable, and those participants who experienced more extreme 
feelings were also less happy in general. 

In a multimethod feasibility study, Hunscher, Vincent, and Macy (1930) investigated 
the relation among physiological, psychological, and social factors, and the secretion of 
breast milk. For 3 years, three lactating women measured the quantity of the milk they 
produced and rated their mood on 12 bipolar rating scales every day. Moreover, a female 
observer lived 2 days each month (for 1 year) in their homes to weigh the food eaten by 
the women, and to note the women’s activities and emotional reactions. 

To investigate the physical, mental, and behavioral changes accompanying the men- 
strual cycle, McCance, Luff, and Widdowson (1937) asked 167 women to complete a 
structured daily diary for 6 months. They found that menstrual cycles varied with every 
woman, which contradicted results of previous research based on questionnaire meth- 
ods. Plotting results over days of the cycle, they could show that the intensity of sexual 
desire and incidence of sexual intercourse were associated with the menstrual cycle, as 
were pain in the back and belly; feelings of fatigue, depression, and irritability; and the 
tendency to cry. 


Diaries in Econometrics and Time Budget Research 


Already by the end of the 18th century, income and expenses of workers’ families were 
recorded in housekeeping books to study their material living conditions (Szalai, 1966). 
To this day, such studies are conducted by econometrists, with intensive methodological 
research on expenditure diaries (Sudman & Ferber, 1971). 

At the beginning of the 20th century, interest grew in how people—especially those 
from the working class—spend their time (e.g., Bevans, 1913; see also Szalai, 1966). 
Unsatisfied with the validity of retrospective reports, Nelson (1933, as cited in Fox, 1934) 
designed a structured diary form to assess activities performed in half-hour units. Almost 
500 female employees completed the diary for 3 consecutive days. Nelson commented 
that although this would still be a “meager sample of any one individual’s time, ... it is 
probably a more true picture of the division of a day ... than innumerable questionnaires 
would or could ever elicit” (Nelson, 1933, p. 4, cited in Fox, 1934, p. 495). To investigate 
how gender, age, and socioeconomic factors influenced children’s leisure activities, Fox 
(1934) slightly adapted Nelson’s diary and gathered data from 372 boys and girls living 
in different suburban areas. Among other results, Fox found that girls had more home 
duties and therefore less leisure time than boys. Moreover, children living in the poorest 
suburb spent more time in paid work and work around the house but less time in leisure 
activities than did children living in the wealthier areas. Similar studies followed in dif- 
ferent countries (see Szalai, 1966). 

After World War II, 24-hour diaries to assess the timing and frequency of activities 
performed during a day were used in large population samples (e.g., 12,000 persons in a 
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Hungarian microcensus in 1963; Szalai, 1966). This method was also used in the multi- 
national comparative time budget project initiated in 1964 and supported by the United 
Nations Educational, Scientific and Cultural Organization (UNESCO), which coordi- 
nated research activities in 12 countries from the East and West (Szalai, 1966, 1972). 
Since 1985, statistical agencies in more than 15 countries have gathered national time use 
data with similar methods (Harvey, St. Croix, & Reage, 2003). 


Diaries in Health Research and the Investigation of Recall Errors 


In a review on health diaries, Verbrugge (1980) cited the Baltimore Morbidity Survey 
(Downes & Collins, 1940) as the first epidemiological study in which health diaries were 
used as a memory aid in monthly interviews among 1,796 families for 1 year. Although 
various researchers had raised doubts regarding the accuracy of retrospective reports, 
systematic research on this topic began in the 1950s. Allen, Breslow, Weissman, and 
Nisselson (1954) conducted a method study in which they compared interview and diary 
data obtained in 400 households. Interviewers asked respondents, who had kept an ill- 
ness diary for the last month, whether an illness had occurred during the last 30 days. 
For medically attended illnesses, differences between diary and interview estimates were 
small, but for medically unattended illnesses, results differed substantially. Allen and col- 
leagues concluded that the differences between the two methods were “clearly related to 
errors of memory” (p. 922). Their results were confirmed in other health studies in the 
1960s and 1970s (Verbrugge, 1980). 

However, Allen and colleagues (1954) were not the first to investigate recall errors 
in remembering daily life events. Already in 1930, Thomson had investigated whether 
people tended to forget the unpleasant events faster than the pleasant events, which was a 
hypothesis derived from Freudian theory. Among other tasks, the 29 participants kept a 
diary for 5 days and recorded pleasant and unpleasant events they had experienced. Two 
weeks later, they remembered about one-third of both the pleasant and unpleasant events 
they had recorded. After 4 weeks, they remembered one-fourth of the pleasant but only 
one-sixth of the unpleasant events. 

This early work laid the foundation for subsequent research into the differences 
between momentary and retrospective reports that have been intensively investigated in 
experiments and ecological momentary assessment studies (Fahrenberg, Hüttner, & Leon- 
hart, 2001; Schwarz, 2007; Chapter 2, this handbook). The observation that momentary 
assessment leads to less distorted information by avoiding recall biases (and minimizing 
naive theories and heuristic processing) has become a major legitimization for the use of 
these methods (e.g., Fahrenberg et al., 2007; Trull & Ebner-Priemer, 2009). 


Self-Monitoring in Behavior Therapy 


In the 1960s, the influence of behaviorism declined, and cognitive approaches in dif- 
ferent areas of psychology, including psychotherapy, became popular. In the context of 
cognitive-behavioral therapy, disorder-specific diaries and self-observation procedures 
have been developed for applied and research purposes since the late 1970s. Patients in 
research and practice are often instructed to record a target behavior whenever it occurs 
(following an event-contingent protocol). In addition, the situational circumstances, 
accompanying cognitions (e.g., dysfunctional thoughts and self-instruction), and feelings 
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are recorded to obtain information about situational antecedents and consequences of 
the behavior (e.g., Beck, Rush, Shaw, & Emery, 1979; for a review, see Thiele, Laireiter, 
& Baumann, 2002). 

In a clinical diary study conducted in the early 1970s, Lewinsohn and Libet (1972) 
had patients and controls rate their positive activities and mood at the end of each day 
over a period of 1 month. They found that both measures correlated within individuals 
across time, which they interpreted as a confirmation of the theoretical claim that the 
intensity of depression is due to a lack of positive reinforcement. Lewinsohn and col- 
leagues replicated this finding in later studies. 

Clinical psychologists have also had a strong interest in using self-monitoring to 
facilitate desired behavior changes. Confirming results of field studies (e.g., McFall, 1970; 
McFall & Hammen, 1971), Kazdin (1974) found in a series of experiments that the act 
of self-monitoring alone changed behavior (thus producing reactivity), but the valence of 
the target behavior determined the direction of change. (For a more detailed discussion of 
research on reactivity, see Barta, Tennen, and Litt, Chapter 6, this volume.) 


Random Sampling of Moments and Other Designs 


In the field of psychology, the idea to sample situations at random moments goes back to 
Egon Brunswik (1943, 1955). When situations (or stimuli) are sampled at random from a 
universe or, in Brunswik’s terms, an ecology of situations, they are representative for this 
universe and thus ecologically generalizable. However, already in the 1930s British statis- 
tician Leonard Henry Caleb Tippett had introduced the Snap Reading Method in indus- 
try: Machine operators in a textile factory were observed at random intervals in order to 
estimate the percentage of time machines were idle. Under the label work sampling, this 
method was then applied to investigate workers’ and employees’ activities with the aim of 
enhancing efficiency (Lorents, 1971; Rosander et al., 1958). The work sampling method 
is, however, less suited for the evaluation of higher level jobs, because externally observ- 
able behaviors are rather less relevant. Therefore, organizational psychologists began to 
let employees self-record their activities at random times. 

From six randomly signaled self-record studies, reviewed by Lorents (1971), the first 
was published in 1964 by Hinrichs, who investigated communication activities of 192 
technical employees. Each employee was given a randomly determined time schedule and 
an alarm watch. A couple of years later, pocket-size devices to give random alarms were 
used to investigate university faculty members’ activities (Lorents, 1971), as well as ado- 
lescents’ or patients’ spontaneous thoughts (e.g., Hurlburt & Sipprelle, 1978; Klingler, 
1971; see Hurlburt & Heavey, 2004, for a review). 

Parallel to these applications, Csikszentmihalyi and colleagues developed a similar 
approach, which has become famous under the label experience sampling method (ESM). 
Interested in “flow” experience, they first asked participants to keep diaries and write 
down in the evening what they had done and what they had enjoyed most during the 
day. However, participants’ reports seemed to be written according to scripts and did not 
show much discrimination (Hektner, Schmidt, & Csikszentmihalyi, 2007, p. 7). Thus, 
looking for a more satisfying approach, they provided their participants with beepers that 
were available for medical doctors. Participants were asked to complete questions when- 
ever the beeper—which was activated at random intervals—gave a signal. 

Csikszentmihalyi, Larson, and Prescott (1977) first used this method to investigate 
25 adolescents’ everyday experiences and activities, beeping them five to seven times a 
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day for 1 week. They found that adolescents spent most of the time talking with peers or 
watching TV, but they felt better when talking with peers. In later studies, momentary 
sampling schedules have been varied. For example, investigating the consistency of feel- 
ings across time and situations, Diener and Larsen (1984) used a programmable wrist 
watch to signal students twice a day for 6 weeks to complete a mood scale. 

Since being developed, ESM has been used in hundreds of studies with adolescents, 
students, employees, couples, families, and psychiatric patients, from different social and 
cultural backgrounds, to investigate daily life activities and experiences and how they 
unfold in time (e.g., De Vries, 1992; Hektner et al., 2007; Scollon, Kim-Prieto, & Diener, 
2003). 

Subsequently, ESM came to be recognized as one of three main strategies to sample 
experiences and behaviors. Wheeler and Reis (1991), who acknowledge Klingler (1971) 
for this classification, distinguished (1) signal-contingent sampling—when records are 
completed at fixed or random times indicated by a signal device (the latter comprises 
ESM); (2) interval-contingent sampling—when records are completed at predetermined 
times and usually at regular intervals (as with daily diaries; e.g., Kambouropoulou [1926] 
and Flügel [1925], who had participants complete diaries every evening or every other 
hour, respectively, or Mehl et al. [2001], who recorded snippets of ambient sounds every 
12 minutes); and (3) event-contingent sampling—when records are completed after a 
predefined event occurs (self-monitoring in psychotherapy, Myrtek’s [2004] psychophysi- 
ological monitoring system that triggers records when nonmetabolic heart rate increases, 
or Miinsterberg’s [1908, p. 71] recording and testing “whenever daily life brought [him] 
into a characteristic mental state, such as emotion or interest or fatigue or anything 
important to the psychologist”; for a review, see Moskowitz and Sadikaj, Chapter 9, this 
handbook). For a further discussion of ambulatory assessment designs, see Bolger and 
colleagues (2003), Reis and Gable (2000), Shiffman (2007), and Conner and Lehman, 
Chapter 5, and Reis, Chapter 1, this volume. 


Electronic Devices to Record Experience and Behavior 
and to Control Compliance 


Up to the 1980s, experiences and behaviors were recorded through paper-and-pencil 
questionnaires. However, in the 1980s, various research groups began to use portable 
computers to improve the quality of data assessed in daily life. Pawlik and Buse (1982) 
were the first to develop a computer-assisted system for ambulatory psychological assess- 
ment in cooperation with a hardware company. In later generations of devices, Buse and 
Pawlik (1996, 2001) added further psychometric tests to obtain objective measures of cur- 
rent mental activation and performance outside the laboratory. In addition, finger pulse 
could be measured. In a series of early studies, they assessed high school students’ mood, 
settings, activities, and test performance to investigate the degree of cross-situational 
consistency and personality differences under field conditions (for an overview, see Buse 
& Pawlik, 1996, 2001). 

To obtain event samples of stress-related affective states and the corresponding cop- 
ing behavior under real-life conditions, Perrez and Reicherts (1987, 1996) developed a 
computer-assisted recording system (COMRES) using the first generation of handheld 
computers. Synchronized computer-assisted diaries were then used to study emotional 
transmission, empathic inference, and stress and coping in couples’ and families’ daily 
living (Perrez, Schoebi, & Wilhelm, 2000; Schoebi, 2008; Wilhelm & Perrez, 2004). 
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Other computer-assisted diary applications were developed to study phenomena such 
as anxiety (Taylor, Fried, & Kenardy, 1990), smoking and drinking behavior (Shiffman 
et al., 1994), symptoms of asthma (Hyland, Kenyon, Allen, & Howarth, 1993), and cur- 
rent contextual information relevant for psychophysiological studies (e.g., Fahrenberg, 
1996b, 2001). 

For such early applications, computer programmers had to write the software, which 
was time-consuming and very costly. Therefore, the next generation of technology was 
advanced by flexible software packages that allowed researchers to design their applica- 
tions on a handheld device without additional support of programmers (e.g., MONI- 
TOR: Fahrenberg et al., 2001; the Experience Sampling Program [ESP]: Feldman Barrett 
& Barrett, 2001; for recent reviews, see Ebner-Priemer & Kubiak, 2007; Kubiak & Krog, 
Chapter 7, this volume). 

The introduction of handhelds has had many advantages (e.g., branching of questions 
is easily possible, records have an electronic data format, signal options are available, and 
response times can be stored). Another useful feature is the ability to control participants’ 
compliance with timing. Compliance has been an issue of concern since the early use of 
paper diaries, because any delay in response might introduce recall errors, thus impairing 
the quality of the data. When handhelds (or other electronic devices) are used, researchers 
can control whether participants adhere to the instructions and complete questions when 
they are supposed to complete them (e.g., after receiving a signal). 

Several studies have investigated whether the type of record used (paper vs. elec- 
tronic) affects compliance (e.g., Hyland et al., 1993). For example, Hank and Schwen- 
kmezger (1996) found that substantially more records were completed when the paper 
version was used. They concluded that participants using paper diaries evidenced higher 
compliance by disregarding timing instructions (e.g., through hoarding and back-filling 
records). This claim was also confirmed in a well-conducted experiment by Stone, Shiff- 
man, Schwartz, Broderick, and Hufford (2002), who assigned 80 patients with chronic 
pain to complete 20 questions three times a day for 3 weeks using either an electronic 
diary or a paper diary. Unbeknownst to participants, the paper diaries were fitted with 
photosensors that unobtrusively recorded when a particular page was opened and closed. 
Actual compliance (less than a 30-minute delay) was 94% when an electronic diary was 
used. In contrast, participants using the paper diary reported 90% compliance, but their 
actual compliance, as measured by the photosensor, was only 11%, and hoarding and 
back-filling of answers was common. 

To modify Stone and colleagues’ (2002) conclusion that the validity of paper diaries 
is in question, Green, Rafaeli, Bolger, Shrout, and Reis (2006) presented data from three 
studies to show that, under certain conditions, results gained with both methods were not 
substantially different. Although this led to a further debate (see Bolger, Shrout, Green, 
Rafaeli, & Reis, 2006), the use of electronic devices to assess self-records in daily life has 
become the standard for most applications. 


Ambulatory Electronic Systems to Modify Behavior 


In clinical and health psychology, handhelds have been applied to support behavior mod- 
ification since the 1980s (e.g., treatment of obesity: Burnett, Taylor, & Agras, 1985; 
anxiety disorders: Newman, Kenardy, Herman, & Taylor, 1997). Based on the current 
information entered by the patient, systems were programmed to provide useful basic 
information, give immediate feedback, suggest exercises, or remind the user about more 
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functional coping strategies. More recent approaches have used mobile phone systems in 
combination with Internet platforms (Carter, Day, Cinciripini, & Wetter, 2007; Sorbi, 
Mak, Houtveen, Kleiboer, & van Doornen, 2007). Moreover, various systems have been 
developed, for example, to give children with attention-deficit/hyperactivity disorder 
feedback on their current motor activity (e.g., Schulman, Stevens, & Kupst, 1977; Tryon, 
Tryon, Kazlausky, Gruen, & Swanson, 2006), or to change breathing patterns in patients 
with anxiety disorders (e.g., Meuret, Wilhelm, Ritz, & Roth, 2008). 


History of A Physiological Monitoring 
of Cardiac Activity 


In this next section, we focus on ambulatory monitoring of cardiac activity, and heart 
rate (HR) in particular, because it is a major indicator of ergonomic and psychological 
demands, which have long been measured in daily life. However, many other physiologi- 
cal measures have been assessed in daily life, too, such as blood pressure, skin conduc- 
tance, respiration, electromyography (EMG), electroencephalography (EEG), or hor- 
mones, which we cannot capture in this chapter (see Pickering, 1991; Schlotz, Chapter 
11, and F. Wilhelm, Grossman, & Müller, Chapter 12, this volume). 


The Dawn of Ambulatory Physiological Monitoring 


Unlike momentary reports of current experience and behavior, which can be achieved 
with paper-and-pencil questionnaires, the assessment of physiological reactions requires 
sophisticated technical equipment. Such equipment was developed in the 1940s (e.g., 
Fuller & Gordon, 1948). Wearing a backpack radio transmitter weighing 85 pounds, 
Holter and Gengerelli (1949, cited in Kennedy, 2006) could record the first electrocar- 
diogram (ECG) during free exercise. After the invention of the transistor, devices became 
more powerful and much smaller. By the end of the 1950s, parallel radio transmission of 
10 channels was already possible with compact and lightweight devices (< 1 kg; Dunn & 
Beenken, 1959). Although the inventors of biotelemetric systems immediately recognized 
their potential to investigate unrestrained human and animal behavior under the “physi- 
cal and emotional conditions encountered in the stress and strain of daily living” (Holter, 
1957, p. 913), the use of biotelemetric systems was limited to studies of aircraft and space 
pilots until the end of the 1950s. 


Biotelemetry in Space Traveling 


At the same time, physiological monitoring began to play an essential role in military and 
space research. Biotelemetric systems were developed to monitor aircraft pilots and to 
investigate the possibility of manned space flights (Barr & Voas, 1960; Beischer & Fregly, 
1962). Already in the summer of 1948, U.S. scientists began to launch high-altitude rock- 
ets with monkeys. They continuously monitored and transmitted the monkeys’ HR and 
respiration to the control center to investigate the bodily effects of high acceleration, 
low gravity, and cosmic radiation during rocket flights. Data transmitted showed that 
the monkeys were alive until the rocket shattered on the ground, and that physiological 
systems did not exceed their natural limits (Henry et al., 1952, as cited in Beischer & 
Fregly, 1962). 
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A comparable research program began in 1949 in the former Soviet Union (Beischer 
& Fregly, 1962). In November 1957, Soviet scientists and engineers sent the first living 
being—a dog called Laika—into orbit. Laika’s blood pressure, ECG, respiration rate, 
and motor activity were monitored and transmitted to earth (Sisakian, 1962, as cited in 
West, 2001). Further successful missions with dogs followed, suggesting that manned 
space flight might be possible (Gazenko, Ilin, & Parfenov, 1974). In April 1961, Soviets 
launched the first man—cosmonaut Yuri Gagarin—into orbit. During his 1 hour, 48 
minute journey around the earth, he was televised, and his ECG and respiration were 
recorded and transmitted to earth. Gagarin’s HR increased from 65 beats per minute 4 
hours before launch to 108 beats 5 minutes before launch, indicating emotional arousal 
due to the psychological stress associated with the impending space flight. It further rose 
to over 150 beats per minute after launch, when he was exposed to vibration, noise, and 
increasing acceleration, but decreased to below 100 beats after reaching orbit (Volynkin 
et al., 1962, as cited in West, 2001). When, 8 years later, in July 1969, Neil Armstrong 
landed on the moon, his HR rose to a similar extent, reaching 150 beats per minute, prob- 
ably due to the excitement and stress he was experiencing at that very moment (Simonov, 
1975). 

Thus, the physiological and behavioral monitoring that was an essential part of the 
early space missions stimulated the development of biotelemetry (see Bayevskiy & Adey, 
1975). It was necessary to supervise the physical well-being of early space travelers in 
their training and preparation, and during their space missions. “Without the continuous 
monitoring of the physiological conditions of the astronaut, the margin of safety for his 
life would not be adequate to justify the risk involved” (Mayo-Wells, 1963, p. 512). After 
flight, environmental, biological, physiological, behavioral, and performance data were 
rigorously examined to understand better the effects of environmental changes during 
space travel on the physiological and psychological functioning of the astronauts and 
cosmonauts (Mills Link, 1965). 


Early Ambulatory ECG Monitoring Studies 


To overcome some limitations of biotelemetric systems such as radio interference, short 
assessment duration, and restricted range of transmission, Holter (1961) and colleagues 
developed an ambulatory ECG-recorder (1 kg weight) with a storage capacity of 10 hours. 
In addition they offered a system for fast data screening of the extensive records (about 
50,000 heartbeats in 10 hours; Holter, 1961). 

Thus, at the beginning of the 1960s, different portable ECG-monitoring systems 
were available, and cardiologists began to monitor the ECGs of patients over longer 
time periods, during unrestricted physical exercise, and during their usual daily activi- 
ties, to investigate arrhythmias, conduct disturbances, ischemia, and anginal pain (e.g., 
Bellet, Eliakim, Deliyiannis, & Lavan, 1962). Such studies showed that certain patho- 
logical changes in cardiac activity (e.g., transient arrhythmias, ischemic episodes) were 
not visible in a standard laboratory ECG but did occur during patients’ daily activities 
(e.g., Cerkez, Steward, & Manning, 1965). In addition, physiological effects of particular 
activities performed under different environmental conditions, such as climbing or skiing 
at high altitude, were investigated (Pollit, Almond, & Logue, 1968; Sanders & Martt, 
1966). Moreover, zoologists radio-monitored ECGs of swimming and diving humans, 
mammals, and fish (Baldwin, 1965), and the ECGs, respiration rates, tidal air flows, and 
minute ventilations of flying pigeons (Hart & Roy, 1966). 
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Other researchers investigated physiological effects of real-life stressors. Concerned 
with car accidents caused by myocardial infarction, German traffic physiologist Hoff- 
mann (1962, 1965) used radio telemetry to monitor ECGs of 600 healthy persons and 
several patients with coronary heart diseases while they drove a test car. Just driving a 
car, even in low-density rural traffic, increased HR in most drivers due to the physical 
and mental workload associated with driving. HR increases were stronger in city traffic 
but strongest in critical situations (e.g., sudden stops, overtaking another car) in which 
most pathologically relevant ECG changes occurred. Similar studies followed in which 
the ECGs of patients and healthy controls were monitored while they drove their own 
cars in normal traffic (Bellet, Roman, Kostis, & Slater, 1968) or busy city traffic (Taggart, 
Gibbons, & Somerville, 1969). HR increased during driving in patients and in healthy 
subjects, but manifestations of myocardial ischemia were more frequent in patients. Tag- 
gart and colleagues (1969) also monitored 10 racing drivers and found their HRs to be 
much higher than those of the city drivers. 

To investigate physiological habituation, Fenz and Epstein (1967) compared skin 
conductance, as well as heart and respiration rate, of novice and experienced parachutists 
during a series of events leading up to a jump. In novice parachutists, physiological arousal 
increased steadily until they jumped, whereas in experienced jumpers, an initial increase 
was followed by a slight decrease before jumping. According to Fenz and Epstein, these 
results suggested that novice parachutists, in contrast to more experienced ones, had not 
developed effective inhibitory control mechanisms to master their fear reactions. 

In an early ambulatory study on work-related stressors, Ira, Whalen, and Bogdonoff 
(1963) monitored physicians and medical students in situations perceived as stressful, 
such as staff conferences, ward rounds presentations, and cardiac catheterizations. In 
these situations, they observed strong HR increases that were not noticed by the partici- 
pants. Many studies to assess the cardiac effects of work in different professions followed 
(for a brief overview, see Fahrenberg & Wientjes, 2000; Payne & Rick, 1986). 


The Need to Control for Physical Activity 


When HR of freely moving persons is monitored, as in the study of Ira et al. (1963), it is 
difficult to interpret an HR increase during a particular task or activity as an indicator of 
psychological stress, because such an increase can also be due to an increase in metabolic 
demands caused by physical activity (see F. Wilhelm et al., Chapter 12, this volume). 
Therefore, adequate measurement and control of metabolic demands became essential 
in later studies. Investigating the stress of surgeons during an operation, Becker, Ellis, 
Goldsmith, and Kaye (1983) measured oxygen consumption as an indicator of metabolic 
demands. In fact, HR increases during the operation could largely be explained by meta- 
bolic demands. Payne and Rick (1986) found HR of surgeons during an operation to be 
higher than HR of anesthetists, but they attributed this difference in HR to the different 
physical activity they observed in both professions. 

To control metabolic demands in people’s daily lives, body movements have been 
monitored and taken into account for the interpretation of physiological changes. In 
an early psychophysiological study over 24 hours, Roth, Tinklenberg, Doyle, Horvath, 
and Kopell (1976) recorded ECG and movements of the arm and leg, in addition to self- 
ratings of physical activity and concurrent mood. They then descriptively compared HR 
between situations in which physical activity was similar. Taylor and colleagues (1982) 
proposed an offline algorithm that categorized HR changes in conjunction with physical 
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activity (measured with a motion sensor attached to the thigh) to identify automatically 
episodes of physical exercise, usual physical activity, or “anxiety” (elevated HR but low 
physical activity). 

Some years later, Myrtek and colleagues (1988) developed an interactive ambulatory 
monitoring system to detect nonmetabolic HR increases (additional HR), which indicate 
emotional arousal or mental workload (see also Myrtek, 2004). The system monitored 
ECG and physical activity (assessed with accelerometers on the chest and the thigh) over 
23 hours. During the monitoring, an online algorithm compared HR and physical activ- 
ity. An increase was assumed to indicate emotional arousal when the HR at a given 
minute was substantially higher than the average HR during the preceding 3 minutes, 
but physical activity increased only slightly or not at all. When the system detected such 
an HR increase, it emitted an acoustical signal. Participants then briefly recorded their 
current emotional state, activity, and setting. In addition, the system generated random 
signals. Thus, participants answered questions presented by the interactive recorder every 
10 to 20 minutes, without knowing whether the questioning was triggered randomly or 
by their physiological arousal. Studying more than 1,300 individuals (students, school- 
children, workers, and patients) Myrtek and colleagues found, for example, that although 
emotionally triggered HR increases were frequent, participants did not usually recognize 
them, which indicates that perception of physiological arousal during daily life is rather 
poor (Myrtek, 2004; Myrtek, Aschenbrenner, & Brügner, 2005). 


Ambulatory Psychophysiological Monitoring of Patients 
with Mental Disorders 


In the 1980s, Roth, Taylor, and colleagues from the Stanford School of Medicine began 
to investigate anxiety disorders in patients’ daily lives. Initially, they were interested in 
panic disorder, which is characterized by a sudden onset of intense fear and is therefore 
difficult to study under laboratory conditions. In a pioneer study, Taylor, Telch, and 
Havvik (1983) monitored HR and physical activity of 10 panic patients over 24 hours. 
They reported eight attacks, of which three were associated with increased HR that was 
not due to physical activity. Monitoring ECG for 3 days, Margraf, Taylor, Ehlers, Roth, 
and Agras (1987) found that HR increased when panic was experienced during feared 
situations or in anticipation of such situations, but not during spontaneously occurring 
attacks. In summary, such studies revealed that in contrast to patients’ self-reports, panic 
attacks usually occur with only moderate or even without physiological changes (Alpers, 
2009). Moreover, they provided evidence against etiological hypotheses, which propose 
that panic episodes are due to cardiovascular pathology or peripheral hyperresponsive- 
ness. 

In later studies, ECG, skin conductance, respiration, and other measures were mon- 
itored during in vivo exposure of patients with driving or flight phobia. In contrast to 
healthy controls, patients reported more anxiety and higher physiological arousal while 
driving a car (Alpers, Wilhelm, & Roth, 2005; Sartory, Roth, & Kopell, 1992) or fly- 
ing by plane (Wilhelm & Roth, 1998). As expected, anxiety and physiological arousal 
decreased during a second exposure, indicating successful habituation. However, no 
habituation occurred in patients who took benzodiazepines during the first exposure, 
suggesting that anxiolytic treatment hinders effects of exposure (Wilhelm & Roth, 
1997). Meanwhile, ambulatory psychophysiological monitoring has also been conducted 


A Historical Review 75 


with other diagnostic groups (e.g., depressed and borderline patients; see Ebner-Priemer 
& Trull, 2009). 


History of Monitoring Body Movements 
and Physical Activity in Daily Life 


Many methods and devices have been developed for assessing body movements in people’s 
daily lives (for an overview, see Bussmann & Ebner-Priemer, Chapter 13, this volume; 
Bussmann, Ebner-Priemer, & Fahrenberg, 2009; Tryon, 1991). From the data collected 
with these methods, important features of behavior in everyday life can be inferred. In 
the following sections, we briefly review the origin and development of research based on 
step counters (pedometers) and other devices to assess movements of the whole body or 
of particular parts of the body. 


Counting Steps and Estimating Distances Walked 


The first blueprint for a mechanical pedometer to count pendulum movements of the 
thigh was devised by Leonardo da Vinci (1452-1519). One hundred years later, Anselmus 
Boeti de Boodt (1550-1632), physician and curator of gems of Rudolph II in Prague, 
described a pedometer combined with a compass to record on paper the distance and 
direction walked (Hoff & Geddes, 1962). Until the 20th century, geographers and 
explorers used pedometers to measure distances walked, thereby improving the precision 
of route descriptions and maps. 

In human research, mechanical pedometers were probably the first instruments 
that allowed automatic recording of people’s daily life activity. Already in 1882, Hodges 
had reported that during a workday in a hospital, nurses and other health professionals 
walked between two to four times the distance of women who spent the day at home, as 
measured by a pedometer. Based on observations and pedometer measures, Thompson 
(1913) later highlighted unfavorable and inefficient work conditions (e.g., long wards) 
that required nurses to walk too much. 

Pedometers played a role in a variety of early 20th-century research. For exam- 
ple, physicians used pedometers in case studies to demonstrate physical improvement of 
patients after a new pharmacological treatment (e.g., Fordyce, 1914). Sociologist Cur- 
tis (1914) suggested measuring children’s daily activity (by pedometer) to investigate 
how environmental conditions—in particular, availability of attractive playgrounds— 
affected their activity levels. From his own observations and single pedometer measures, 
he hypothesized that schoolchildren living in environments without adequate provision 
for play would move 2-3 miles less per day. Billings (1934) measured the activity of six 
female psychiatric patients throughout their menstrual cycles. He found a postmenstrual 
peak in women’s activity, which continuously declined until the next period. About 40 
years later, Morris and Udry (1969) reported that women taking contraceptive pills were 
less active, as measured by the pedometer, than women who did not take contraceptive 
pills. 

New insights have also been gained by self-studies. One remarkable self-study was 
by a Swiss doctor who counted his steps by using a pedometer during 1 year. He found 
that he had walked 9,760,900 steps in total (“A Doctor’s Footsteps,” 1894). Wearing a 
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pedometer during winter while performing mostly sedentary activities (writing), physi- 
ologist Benedict (1910) recognized that he still had been walking about 7 miles per day, 
much more than he had expected. Based on a similar experience, when measuring the dis- 
tances walked during his work in the hospital, German physicist Lauter (1926) concluded 
that the extent of physical activity cannot be accurately assessed by mere self-report. This 
conclusion was confirmed in later studies (Bussmann & Ebner-Priemer, Chapter 13, this 
volume). 

Investigating the etiology of obesity, Lauter (1926) claimed that overweight is the 
result of increased caloric intake in conjunction with lower energy expenditure due 
to reduced physical activity. Although Lauter did not assess physical activity in obese 
patients, his reasoning was later confirmed in a pedometer study by Larsen (1949; as cited 
in Chirico & Stunkard, 1960), who showed that hospitalized obese patients walked less 
than hospitalized nonobese patients. Chirico and Stunkard replicated the results in two 
studies with obese and nonobese women and men wearing individually calibrated pedom- 
eters for 1 or 2 weeks, respectively. In addition, Stunkard (1958) took daily pedometer 
measures of eight obese psychotherapy patients over several months. He reported exam- 
ples of short- and long- term associations between daily physical activity and patients’ 
mood (assessed in therapeutic sessions), showing that bad or depressed mood was associ- 
ated with a decrease in activity. 

More than 100 years ago, Kisch (1903, as cited in Joslin, 1903) recommended the 
use of pedometers in the treatment of obese patients. In a recent meta-analysis of 26 stud- 
ies (Bravata et al., 2007), pedometer use resulted in an average increase of 1 mile in daily 
walking and led to positive health effects, such as a decrease in systolic blood pressure. 
However, weight reduction did not seem to be associated with an increase in walking due 
to pedometer use. 


Assessing Body Movements during Activity, Rest, and Sleep 


To investigate rest and activity patterns in different species, including humans, Viennese 
biologist Szymanski (1918) designed different actographs. Based on the principle of a 
balance, his actographs recorded movements of an animal in a cage, or movements of a 
human in a bed or on a seat. In a serious of experiments, Szymanski demonstrated that 
activity patterns follow a 24-hour cycle even in the absence of external stimuli such as 
light or temperature, providing evidence for an endogenous biological clock. In contrast 
to human adults, who usually have an activity period during the day and a rest period 
during the night, Szymanski (1922) observed a polyphasic cycle in newborn infants, with 
five to six activity periods relatively equally distributed over 24 hours. Simplified acto- 
graphs, which could be attached to a bed, were developed by sleep researchers. With such 
a hypnometer, Karger (1925) recorded body movements to investigate normal and dis- 
turbed sleep behavior in children. He observed large variations in children’s movements 
during sleep and found that the extent of activity was not an indicator of sleep depth. 

A few years later, the EEG was introduced in sleep research, and polysomnography 
(recording of EEG, electrooculography [EOG], EMG, and ECG) assessed in a sleep labo- 
ratory has become the diagnostic “gold standard.” Nevertheless, monitoring of physical 
activity still plays an important role in sleep assessment. Early validation studies in the 
1970s demonstrated high agreement between arm movements measured by electronic 
wrist-worn movement sensors (actigraphs, see later) and stages of sleep and wakefulness 
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determined by EEG (Sadeh, Hauri, Kripke, & Lavie, 1995). Since the 1980s, algorithms 
for automatic distinction between sleep and wakefulness from actigraph data have been 
developed. Actigraphs enable an easy assessment of sleep behavior in the natural envi- 
ronment over long time periods, with almost no burden for participants; as a result, 
they have been used in hundreds of sleep studies (Ancoli-Israel et al., 2003; Sadeh et al., 
1995). 


Assessing Movements of Body Parts 


Schulman and Reisman (1959, as cited in Johnson, 1971) modified an automatically 
winding wristwatch to record acceleration due to movement of the body part to which 
it was attached (usually a wrist or ankle). Schulman and colleagues used this modified 
watch—called an actometer—to investigate physical activity in children with hyperactiv- 
ity and retardation, and to explore effects of sedative drugs (see Johnson, 1971; Tryon, 
1991). MacCoby, Dowley, and Hagen (1965) investigated whether there is a relationship 
between intelligence and activity, measured with an actometer. He found no correlation 
between activity in nursery school children’s free play and their IQ. However, during 
tasks requiring inhibition of movements, children with higher IQs moved less, indicating 
that ability to inhibit activity becomes crucial for cognitive performance. 

In the 1970s, technology progressed rapidly. Actigraphs (watch-size electronic tilt 
counters, with piezoelectric sensors) and accelerometers (measuring acceleration and 
sensitive to frequency and intensity of movements) were developed (Montoye & Taylor, 
1984; Tryon, 1991). Size, precision, and storage capacity of the devices have since been 
steadily improved, and the number of studies conducted with such devices has rapidly 
increased in sleep and other clinical and health research areas. 

Changes in physical activity and movement patterns are diagnostic criteria for many 
mental disorders. Therefore, motor behavior in psychiatric patients’ daily lives has been 
investigated extensively (Stanley, 2003; Teicher, 1995). In addition to pedometers, acti- 
graphs and accelerometers have been used to estimate energy expenditure, which is the 
most crucial control variable in psychophysiological monitoring studies and also is rel- 
evant in research on health behavior. Reviewing different methods to assess physical 
activity in epidemiological research, Laporte, Montoye, and Caspersen (1985) mention 
high accuracy, low reactivity, and low burden for participating subjects as advantages 
of electronic movement monitors. However, they argued that large population studies 
remain unlikely until prices of the devices decrease. Meanwhile, such population studies 
have been conducted. For example, Riddoch and colleagues (2007) monitored physical 
activity of 5,595 British children for 7 consecutive days with accelerometers worn at the 
hip. Alarmingly, they found that only 5.1% of the boys and 0.4% of the girls achieved the 
recommended level of physical activity. 


Assessing Coordination of Movements and Identifying Different Activities 


More than 130 years ago, French physician and inventor Etienne Jules Marey (1878) 
described devices and procedures to assess synchronization of movements in humans 
and animals during gait. Monitoring of complex movement patters in humans under 
daily life conditions became possible in the 1990s with the development of multichannel 
accelerometry. Signals of various sensors attached to different body parts are recorded 
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simultaneously. With a configuration of three piezoresistive sensors, one attached to the 
sternum and two at the thighs, Foerster (2001) and colleagues could automatically dis- 
criminate the current body position (standing, sitting, lying) and motor activity (e.g., 
walking, running, riding a bike). Multichannel accelerometry now plays an important 
role in rehabilitation medicine (Bussmann, 1998; Bussmann et al., 2009; Bussmann & 
Ebner-Priemer, Chapter 13, this volume). 


Summary and Conclusion 


In this chapter, we have highlighted important steps in the history of three major 
approaches to assess experiences and behavior in people’s daily lives: Diaries and self- 
record methods, psychophysiological monitoring, and monitoring of body movements. 
In reviewing this history, we were surprised to recognize that the roots of research con- 
ducted in daily life reach well back to the 19th century. Already by the time Wundt had 
founded the first psychological laboratory, Preyer (1882/1889) outlined systematic rules 
to enhance the trustworthiness of observation records of infants’ spontaneous behavior. 
In addition, Hodges (1882) used pedometers to obtain objective measures of distances 
walked at the workplace. Miinsterberg (1892) proposed three dimensions of affect expe- 
rienced in daily life, based on a long-term self-record study, to which he added psycho- 
logical tests performed under field conditions. These pioneers conducted case studies. 
However, since the 1920s, impressive group studies have been conducted in which partic- 
ipants record their daily or even momentary experiences and events on structured scales. 
Even at this early stage, the motivation for researchers to gather data in a person’s daily 
life was the same as it is today: avoiding distortions due to memory biases and enhancing 
the ecological validity (ecological generalizability) of results. 

By tracing these methods back to the 19th century, we add to other reviews that 
located their beginnings more recently in the 20th century. For example, reviewing the 
origin of self-record methods, Wheeler and Reis (1991) acknowledged time budget and 
work sampling studies—mainly conducted after World War II—as precursors of daily 
self-record studies. More recent reviews with a historical perspective date the oldest self- 
record studies back to the 1920s (e.g., Stone et al., 2007). Wheeler and Reis also noted 
that due to behaviorism, there would have been a moratorium for the study of inner 
experience between the 1920s and 1960s. However, we found such studies conducted in 
the 1920s and 1930s, so we could not confirm their notion. On the other hand, Fahren- 
berg (2001) associated the origin of ambulatory assessment (characterized by the use of 
mobile electronic or mechanical devices to measure experiences, physiological reactions, 
or behavior in people’s daily lives) with Holter’s invention of portable ECG-monitoring 
recorders in the 1950s. Today, we would see Hodges’s (1882) use of pedometers to mea- 
sure distances walked at the workplace as the beginning of ambulatory assessment. His- 
torical reviews in the future will probably find even older studies, with an increase in the 
number of psychological and medical journals that allow electronic screening of early 
articles. 

Although researchers have long been aware of the advantages of an objective assess- 
ment of physiological reactions and daily movement behavior, such research has mostly 
been conducted in a medical or biological context. Despite the great potential of both 
approaches for psychological research, their applications in psychology have been rather 
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the exception (see, e.g., Fahrenberg & Myrtek, 1996, 2001; Tryon, 1991). However, as 
shown in recent reviews and this handbook, the situation is changing, and more psychol- 
ogists are beginning to examine objective features of behavior and physiology in addi- 
tion to subjective experiences in people’s daily lives. This development has been fostered 
through scientific networks that have recently merged into the Society for Ambulatory 
Assessment (www.ambulatory-assessment.org). 

In order to fully appreciate the contributions of early researchers, it is important to 
recognize that they had to overcome many obstacles that made their studies much more 
cumbersome and laborious than today. Especially in diary studies, researchers had to 
manage and analyze the multitude of information they gathered, without the help of 
computers. Results were usually presented descriptively. However, some pioneers already 
calculated statistical tests (e.g., Flügel, 1925; Kambouropoulou, 1926; Thomson, 1930). 
Since the 1980s, powerful statistical methods, such as multilevel modeling (Goldstein, 
1995; Raudenbush & Bryk, 2002), have been developed that are able to cope with the 
peculiarities of data gathered in daily life (hierarchical structure, unequal number of 
observations, missing data, autocorrelation). Multilevel models have been increasingly 
applied since the 1990s and are meanwhile the standard method for analyzing such data 
(e.g., Bolger et al., 2003; Reis & Gable, 2000; for recent developments, see Part III, this 
volume, on data-analytical methods). The availability of handheld computers for data 
gathering and the improvement of statistical methods have contributed to the rapidly 
increasing number of diary studies during the last two decades. 

As this handbook shows, conducting research in daily life has become more sophis- 
ticated but also easier than ever before. Researchers today have available a great variety 
of recording technologies, computer programs, and statistical tools to investigate many 
different facets of people’s experience and behavior in daily life—advances about which 
pioneers in the field could only dream. 
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PART II 


STUDY DESIGN CONSIDERATIONS 
AND METHODS 
OF DATA COLLECTION 


CHAPTER 5 


Getting Started 
Launching a Study in Daily Life 


TAMLIN S. CONNER 
BARBARA J. LEHMAN 


C apturing naturalistic human experience is a fundamental challenge for researchers 
in psychology and related fields. Fortunately, as demonstrated by the broad scope 
of this handbook, there is an increasing number of methodologies for studying the full 
range of experiences, behavior, and physiology as people go about their daily lives. Our 
goal in this chapter is to provide a starting point for researchers interesting in using these 
methods. Our aim is to provide practical guidance and basic considerations in how to 
design and conduct a study of individuals over time in their naturalistic environments. 
We explain how basic design considerations depend on a number of factors—the type of 
research question, the characteristics of the sample of interest, the nature of the phenom- 
ena under investigation, and the resources available to conduct the research. We address 
each consideration in turn and integrate our practical discussions with examples that 
draw on different types of research questions. In this way, our chapter lays a foundation 
for the more advanced chapters that follow in this section of the Handbook. 


Preliminary Considerations 
in Designing a Study to Capture Daily Life 


Figure 5.1 outlines the important steps in designing a study of everyday life. The first 
steps are to determine the research question, the target variables (experience, behavior, 
and/or physiology), and the population of interest (university students, children, married 
couples, cancer patients, etc). These decisions, in turn, influence the type of method, 
sampling strategy, technology platform, and conduct of the study. 
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The Research Question and Target Variables 


Every study starts with a well-defined research question that determines whether inten- 
sive longitudinal methods are appropriate. These methods are geared toward investigat- 
ing micro-level processes at the daily level—that is, the content of and patterns surround- 
ing experiences, behaviors, or physiology as they unfold in real time, or close to real time, 
in daily life. Accordingly, these methods are best suited to questions aimed at the micro- 
level, such as “How does mood vary across the course of a week?”, “Which individuals 
cope more effectively with daily stressors?”, or “How do cardiovascular responses dif- 
fer between the workplace and home environments?” Questions about macro-level pro- 
cesses, such as the relation between lifetime events and lifetime outcomes (e.g., childhood 
abuse leading to adult depression), are better suited to traditional longitudinal methods 
in which assessments are taken across months, years, or decades rather than moments 
or days. Of course, macro-level variables can be examined as predictors of daily micro- 
level processes (e.g., “Do child abuse victims react differently to daily stressors than 
nonvictims?”)—but ultimately the research question must contain a focus on micro-level 
daily processes as predictors or outcomes. 

Intensive longitudinal methods are also better suited to answering descriptive or cor- 
relational questions rather than causal questions. Experimental control is lacking in natu- 
ralistic environments: There is no random assignment to different daily circumstances; 
and everyday events are not manipulated, at least not by experimenters. As a result, these 
methods are far better suited for understanding the social or emotional circumstances in 
which a phenomenon occurs, or the conditional effects of environmental events on psy- 
chological processes. Terms such as predict and associated are favored over terms such 
as cause and effect. Some elements of causality can be tested by using lagged analyses, in 
which events at an objectively earlier time period (e.g., stress during the day) are used to 
predict later experiences (e.g., drinking that night); however, lagged analyses only indi- 
cate precedence, a necessary but not sufficient condition of causality. 

The decision to use micro-longitudinal methods also requires acknowledging another 
caveat— dispelling the myth that momentary assessments of experience and behavior are 
the “gold standard” of subjective self-report methods. Historically, as pager-based expe- 
rience sampling methods were introduced and later invigorated with mobile computers, 
real-time data-capture methods developed a reputation as the “gold standard” of self- 
report procedures. The perspective that these methods are inherently better than other 
types of self-reports (retrospective, global/trait) is outdated. As noted in Chapters 1 and 
2 by Reis and Schwarz (respectively, this volume), experiential reports and retrospective 
reports provide different types of information—one is not necessarily better than the 
other. A growing body of research suggests that how people filter and remember their 
experiences, not necessarily what actually happened across various moments, can be the 
stronger predictor of some future behavior (e.g., Wirtz, Kruger, Napa Scollon, & Diener, 
2003). The key determinant is whether the research question concerns the “experiencing 
self” or the “remembering self” (Kahneman & Riis, 2005; see also Conner & Barrett, 
2012). 

Clarifying the research question, in turn, drives decisions about what variables need 
to be measured—experiences, behaviors, physiology, or a combination of these. Table 5.1 
shows a proposed taxonomy of the methods best suited to capturing these variables. One 
key distinction is between active versus passive methods of data collection. Active simply 
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TABLE 5.1. A Taxonomy of Intensive Longitudinal Methods 


STUDY DESIGN CONSIDERATIONS AND METHODS OF DATA COLLECTION 


Experience 


Phenomenology such 

as mood, pain, fatigue, 
cognitions, perceptions, and 
appraisals 


Behavior 


Actions that are observable 
to others such as drinking, 
smoking, exercise, talking, 
eating, interactions, and 
location 


Physiology 


Internal workings of the 
body and brain such as 
temperature, breathing, 
heart rate, blood pressure, 
and hormone levels 


Active 


Passive 


Experiences are self- 
reported by participants 
(e.g., mood ratings, pain 
ratings, stress ratings). 


Experience sampling, daily 
diaries, event sampling 


Experience is inferred 
through observation (e.g., 
unobtrusive auditory 
sampling with the 
electronically activated 
recorder [EAR]). 


Acoustic sampling 


Behaviors are self-reported 
by participants (e.g., 
self-reported alcohol use, 
food/exercise diary, event- 
recording). 


Experience sampling, daily 
diaries, event sampling 


Behaviors are measured 
with no intervention or 
reporting necessary (e.g., 
pedometer or actigraph 
to infer physical activity, 
unobtrusive auditory 
sampling with the EAR, 
GPS to measure location). 


Activity sampling, 
acoustic sampling, passive 


Physiology is measured 
by participants (e.g., 
participant takes saliva 
samples which are 
assayed for cortisol by 
experimenters). 


Neuroendocrine sampling, 
physiological sampling 


Physiology is measured 
with no intervention by 
participant (e.g., passive 
sampling of heart rate and 
blood pressure, continuous 
glucose monitoring, 
temperature tracking, 
measurement of breathing). 


Neuroendocrine sampling, 
physiological sampling 


telemetrics, context 
sampling 


means that the participant is involved in providing the measurement through either self- 
report (as with daily diaries or experience sampling methods) or some other voluntary 
action, such as giving a saliva sample (as with ambulatory neuroendocrinology). Pas- 
sive means that data are collected using devices without any direct involvement of the 
participant except wearing the device. The choice between active and passive methods 
is particularly important when measuring behaviors such as physical activity. As noted 
by Bussmann and Ebner-Priemer (Chapter 13, this volume), the correlation between self- 
reported physical activity and activity measured passively through actigraphs is moder- 
ate at best, suggesting that these methods capture somewhat different although overlap- 
ping phenomena (subjective perceptions of exercise vs. objective bodily movements). This 
active-passive distinction is also relevant to measuring experiences such as emotions. 
From a strict phenomenological perspective, momentary emotional experience can be 
assessed only using real-time self-report methods such as experience sampling, daily dia- 
ries, and event-contingent sampling; however, emotions can also be observed and coded 
from verbal capturing behavior with ambulatory acoustic monitoring (see Mehl & Rob- 
bins, Chapter 10, this volume). As Mehl and Robbins explain, these two approaches 
reflect different perspectives: one from the viewpoint of the self, the other from the view- 
point of the observer. The key to designing a successful study is to choose the method 
that maps onto the intended construct—whether through the eyes of the actor or inferred 
through other means. 
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Characteristics of Research Participants 


Choices regarding design, research protocol, and the platform for data collection are 
influenced by the characteristics of the participants. Research approaches that are pos- 
sible with university students may not be feasible or wise in other samples. Important 
sample considerations include whether or not all potential participants have consistent 
Internet access, sufficient comfort with technology, and strong verbal skills, as well as 
whether they can be trusted to follow protocols and care for equipment. The cost of 
equipment, consequences of equipment loss or damage, and available incentives to aid in 
equipment care and return should also inform design choices. For example, an Internet- 
based approach that makes use of auditory, pictorial, or visual stimuli might be par- 
ticularly appropriate for children who have consistent Internet access during the times 
of interest. Adolescents or younger adults might find it most convenient and rewarding 
to use their own mobile phones for text messaging (e.g., Conner & Reid, 2012). Those 
who could be trusted with equipment could also use handheld devices, such as a special- 
ized mobile phone or personal digital assistant (PDA). Depending on the topic, such a 
study might effectively frame responses as Twitter “tweets” or Facebook status updates. 
However, these same approaches would not be as appealing or appropriate for a study 
with older adults, who might find it difficult to read fine print on a PDA, or who may 
lack Internet access or not be accustomed to technology. The human factors involved 
with using PDAs can be especially problematic for some older individuals. The screens 
are small and the audible prompts are often at high frequencies, which can make them 
difficult to hear. For an older adult sample, interactive voice response (IVR) methods 
through the telephone (see Mundt, Perrine, Searles, & Walter, 1995) or pencil-and-paper 
approaches, possibly combined with handheld, automated time-date stamps to confirm 
compliance, may be a more appropriate choice. As younger generations age, access to and 
comfort with technology will be less of an issue. 

Concern for equipment loss would be particularly pronounced if expensive equipment 
such as PDAs, cell phones, or digital cameras were being used by participants with lim- 
ited incentive to return equipment. However, Freedman, Lester, McNamara, Milby, and 
Schumacher (2006) described success in providing inexpensive cell phones to homeless 
individuals undergoing outpatient treatment for cocaine addiction. Participants reported 
on cravings and substance use through an IVR initiated each time they responded to a 
randomly initiated cell phone call. Participants were paid in incremental amounts, total- 
ing up to US $188. Only 1 of 30 participants failed to return the cell phone, and partici- 
pants generally cared for the equipment and followed research protocols. 

Likewise, although participant burden should always be minimized as much as 
possible, a simple, relatively unobtrusive approach would be especially important for 
highly stressed participants who have limited time and resources. For example, a 1-week 
daily diary study is relatively low in terms of participant burden. For stressed but highly 
responsible participants, passive acoustic monitoring (see Mehl & Robbins, Chapter 10, 
this volume) may be especially effective. 


Resources 


Budget and the desired speed of data collection also influence the data collection plat- 
form. Platforms vary considerably in cost, with the most expensive options drawing on 
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ambulatory monitors of cardiovascular or respiratory functioning (e.g., the LifeShirt; 
see F. Wilhelm, Grossman, & Müller, Chapter 12, this volume) and the least expensive 
utilizing pencil-and-paper responses. Appendix 5.1 gives an estimate of the relative costs 
of different approaches. It is best to consider projected costs at the start of a study; this 
allows one to choose the platform that accommodates his or her budget. 

Equipment costs can be the most expensive part of these studies, but they do not 
have to be. For many studies in everyday life, particularly for studies requiring the high- 
est level of control and flexibility, researchers must purchase smartphones, PDAs, tablet 
computers, or other devices before the study is under way (see Kubiak & Krog, Chapter 7, 
this volume, for device options). Although utilizing a simple, once-daily paper-and-pencil 
diary instead of a computerized approach costs less, this choice considerably reduces 
researcher control and is feasible for only some types of research questions. Good low- 
cost computerized alternatives include using a daily survey that people access through the 
Internet, substituting a person’s own cell phone as a signaling device instead of pager, or 
using a person’s own cell phone to conduct experience sampling through text messaging 
(e.g., Kuntsche & Robert, 2009) or IVR (e.g., Courvoisier, Eid, Lischetzke, & Schreiber, 
2010). There are also low-cost ambulatory devices such as pedometers for measuring 
physical activity. If greater control is needed, or if more expensive equipment is required, 
it is possible to minimize costs by collecting data in waves on a smaller number of partici- 
pants (e.g., purchasing 10 devices and collecting data in several waves to achieve a larger 
sample size). It is also possible for researchers to share resources and equipment, and 
therefore embed data for several testable hypotheses within a single study. Last, although 
it adds to the overall costs, it is useful to purchase some replacement equipment at the 
start of a study. For example, purchasing an extra 20% (e.g., two extra devices for a set 
of 10) extends the lifetime of the fleet because units can be replaced individually as they 
become lost or damaged. 

Of course, costs arise not only from equipment expenses but also from payments 
to research assistants and research participants. Because research in everyday life often 
involves considerable contact with participants and produces large quantities of data, 
the research is often conducted by a relatively large team of well-trained research- 
ers. Likewise, participation often requires considerable time and attention to detail. 
It is therefore useful to compensate participants well and offer appropriate incentives. 
Researchers vary in how they compensate participants. Sometimes researchers tie pay- 
ment to response rates (e.g., paying for every report made), which is a good strategy 
for daily diaries but may not work for more intensive sampling that involves multiple 
reports per day. In these cases, researchers can set a minimum criterion to be met for 
payment (e.g., completing a minimum of 50-75% of all reports), or offer payment or 
raffle entry based on the percentage of reports made. Remuneration can be supplied in 
forms other than cash payment, such as university course credit, raffles, gift certificates, 
and other incentives. Remuneration is a somewhat controversial issue. Researchers 
should consult with their own ethics boards to determine what remuneration formats 
provide the appropriate balance of compensation and incentive, without being overly 
coercive. Offering appropriate incentives throughout the procedure, although not so 
much that it would be unnecessarily coercive, can proactively reduce attrition and help 
to protect equipment. 
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Sampling Strategy and Technology Platform 


An essential step in designing a study is to choose the frequency and timing of observa- 
tions (also called the sampling strategy). Appendix 5.1 presents four main types of sam- 
pling strategies, a summary of when to use them, and the technology platforms available 
at the time of publication for each approach. As shown in column 2, sampling strategies 
depend on the research question and the characteristics of the phenomenon of interest— 
the type of experience, behavior, or physiology that one wishes to sample. Expected 
frequency of the phenomenon is the key to deciding the type of sampling strategy. If the 
occurrence of the phenomenon is relatively rare, it is unlikely that randomly sampled 
moments throughout the day would effectively capture the behavior. A better strategy 
for such an occurrence would be for participants to record decisions after the event hap- 
pens (using event-based sampling). This may require a longer period of time (weeks to 
months) to allow for sufficient recording of events. In contrast, if the behavior of inter- 
est is ongoing (e.g., mood or physiology), then experience sampling throughout the day 
over a shorter time span (days to weeks) or continuous sampling will work better. It is 
important for researchers and reviewers to note that most studies of everyday life are 
designed to sample observations randomly, with the intent of generalizing to a popula- 
tion of occasions. Although it was controversial at the time, Brunswik (1955) similarly 
argued that to capture psychological phenomena effectively, researchers must necessarily 
sample observations from environmental contexts, just as they sample from populations 
of participants. It is therefore important to sample enough occasions or time periods that 
the selected observations can generalize to the population of experiences. 

As shown in Appendix 5.1, the first two sampling strategies are “time-based” pro- 
tocols (Bolger, Davis, & Rafaeli, 2003), in which timing of assessments occurs at either 
variable or fixed times. For variable sampling, also called signal-contingent sampling 
(Wheeler & Reis, 1991), assessments are made in response to a signal delivered at unpre- 
dictable times, typically between four and 10 times each day. The signals are usually dis- 
tributed throughout the day “randomly within equal intervals.” For example, for a study 
sampling eight times a day between 9:00 A.M. and 9:00 P.M., the first signal would come 
randomly between 9:00 A.M. and 10:30 A.M., the second signal would come randomly 
between 10:30 A.M. and 12:00 P.M., and so on, because there are eight 90-minute inter- 
vals in those 12 hours. A minimum necessary time between signals (e.g., 30 minutes) can 
also be specified, so that signals are not too close together. By sampling across the day, 
this approach aims to generalize across a population of occasions during wakeful hours. 
At each signal, people may directly report their current experiences (e.g., mood, pain), or 
passively have an indirect assessment taken through ambulatory techniques (e.g., blood 
pressure readings). With variable sampling, the risk for participant burden is high. As 
such, observations each day should occur frequently enough to capture important fluc- 
tuations in experience, but not so frequently that they inconvenience participants without 
any incremental gain (Reis & Gable, 2000). There are no set rules for the number of 
sampled moments per day, but general guidelines have emerged. The usual range is 4-10 
signals per day, with about 6 being normative. For example, Delespaul (1992) advises 
against sampling more than six times per day over longer sampling periods (i.e., 3 weeks 
or more) unless the reports are especially short (i.e., 2 minutes or less) and additional 
incentives are provided. Because assessments are frequent and unpredictable, variable 
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schedules are well suited for studying target behaviors that are ongoing and therefore 
likely to occur at a given signal (e.g., mood, pain, physiology, stress levels). They are also 
appropriate for studying subjective states that are susceptible to retrospective memory 
bias (e.g., emotions, pain, or any experience that is quick to decay), as well as states that 
people may attempt to regulate if reports are predictable. The main disadvantage of vari- 
able time-based sampling is the heightened burden to participants, who are interrupted 
by the signal. Fortunately, participants typically become accustomed to the procedure 
rather quickly (see the section on practical concerns in study design below). 

For a fixed timing schedule, also called interval-contingent sampling (Wheeler & 
Reis, 1991), assessments are made at set times throughout the day (e.g., at morning, after- 
noon, and evening intervals or each night). Participants may be asked about their experi- 
ences or behaviors at that moment (momentary report) or during the time frame since the 
previous report (interval report). This latter format requires some retrieval or reconstruc- 
tion over a period of time and is best used for studying concrete events and behaviors 
that are less susceptible to memory bias (a checklist of daily events; reports of exercise, 
food intake, etc.). One of the most common types of fixed sampling protocols is the daily 
diary, in which experiences and behaviors are reported at the end of the day for a period 
of time (typically 1-4 weeks). This approach is covered in greater detail by Gunthert and 
Wenze, Chapter 8, this volume. Fixed schedules are also well suited to time series investi- 
gations of temporal phenomena. For example, Courvoisier and colleagues (2010) utilized 
a fixed timing schedule to assess six mood measures each day for 1 week, and evalu- 
ated within- and between-day fluctuations in mood and compliance with a mobile phone 
survey. The fixed nature of observations allows one to make generalizations about time 
(e.g., diurnal or weekly patterns in mood) by statistically comparing responses within 
and between individuals. The timing of the assessments must map onto the nature of 
the phenomenon, however. For cyclical temporal phenomena, such as diurnal variations 
in mood or cortisol responses, the observations need to be frequent enough to catch the 
trough and peak of the cycle. Otherwise, cycles will be missed or misidentified (an error 
known as aliasing in the time series literature). Fixed schedules are typically the least 
burdensome to participants. Reports are made at standardized times, and participants 
can configure their schedules around these reports. This regularity can be a drawback, 
however. If people make their own reports or initiate them in response to a signal at a 
fixed time, their reports will not reflect spontaneous contents in the stream of conscious- 
ness. Reports can also be susceptible to mental preparation and/or self-presentation. If 
these issues are a concern, then a variable schedule can be used, or the fixed schedule may 
be “relaxed” so that prompts are delivered less predictably around specified times. 

A third type of sampling protocol is event-based sampling, in which assessment are 
made following a predefined event. This type of protocol, which is also called event- 
contingent sampling (Wheeler & Reis, 1991) and event-contingent recording (Moskowitz 
& Sadikaj, Chapter 9, this volume), is best used to investigate experiences and behaviors 
surrounding specific events, especially those that are rare and may not occur if one is 
sampled at fixed or intermittent times during the day (e.g., instances of conflict, lying, 
smoking, social events). Event-contingent sampling is typically initiated by the person 
soon after an event has occurred. In these applications, event-contingent sampling can 
be challenging to participants, especially if the events are very frequent or too broadly 
defined, so it is important to set clear and appropriately inclusive criteria for the target 
event. The unique challenges and advantages of event sampling are covered by Moskowitz 
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and Sadikaj, Chapter 9, this volume. Recent technological advancements also make it 
possible to trigger event recording through other events, such as changes in environ- 
ment (see Intille, Chapter 15, this volume) or changes in physiological states, such as 
heightened arousal. Although this technology is still in development, Uen and colleagues 
(2003) described the advantages of using ambulatory electrocardiogram readings to trig- 
ger an ambulatory blood pressure reading to identify the occurrence of asymptomatic 
heart irregularities. Likewise, it is possible to use ambulatory cardiovascular readings 
to generate a signal on a PDA or similar device to probe social activities at the time of a 
physiological trigger (see F. Wilhelm et al., Chapter 12, this volume). 

A fourth type of sampling protocol is continuous sampling, in which assessments are 
made continually without any gaps. Continuous sampling methods are currently most 
frequently used to address physiological or medical research questions, but with tech- 
nological advances, researchers are starting to use these approaches in other domains. 
An advantage to continuous sampling strategies is that data are captured throughout the 
duration of the study; nothing is missed. For example, when used in combination with 
reports on the occurrence of stressful events, continuous electrocardiogram monitoring 
allows the researcher to examine not only cardiovascular reactivity to the event itself but 
also cardiovascular recovery following the event. When continuous sampling is used, it is 
possible to analyze the entire period of participation, or to sample random or event-based 
moments from the continuous data stream. In addition to ambulatory cardiovascular 
readings, researchers have continuously sampled physiological readings such as respira- 
tory function (Houtveen, Hamaker, & van Doornen, 2009), skin conductivity (Thurston, 
Blumenthal, Babyak, & Sherwood, 2005), and ambulatory glucose levels (Kubiak, Her- 
manns, Schreckling, Kulzer, & Haak, 2004). The sampling duration of these approaches 
is often short (1-3 days) and limited by the device’s battery life and storage space. 

It is also possible to use more than one sampling strategy within a single study, or to 
use different sampling approaches to address the same general topic. For example, recent 
research has begun to draw on naturalistic approaches to examine the psychological 
effects of stigma management. Beals, Peplau, and Gable (2009) used a combination of an 
event-contingent and a fixed time-based approach to study the emotional consequences of 
opportunities for gay men and lesbians to disclose their sexual orientation. Participants 
in the 2-week study completed a report each time they experienced a situation in which 
they felt they could have disclosed their sexual orientation (event-contingent record) and 
also a nightly time-based report describing their social support, coping, and affect over 
the day (daily diary record). In a study examining a similar topic, Hatzenbuehler, Nolen- 
Hoeksema, and Dovidio (2009) used a daily diary approach that required participants to 
provide a nightly report in which they described both stigma-related and other stressors 
that had occurred during the day. Both African American and gay male and female par- 
ticipants reported greater emotional distress, more rumination, and more emotional sup- 
pression on days when they described stigma-related stressors. These two studies provide 
examples of how several different sampling strategies may be used effectively to study 
everyday experiences with stigma-related stress. 


Duration of Sampling Period and Statistical Power 


The duration of the sampling period is also a key factor in studies of daily life. Studies 
involving multiple reports per day (variable or fixed) typically run from 3 days to 3 weeks. 
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Daily diary studies typically run from 1 to 4 weeks. Event-based studies depend on the 
frequency of the phenomenon, with the duration ranging from 1 week for frequent events 
(e.g., interpersonal interactions) to several months for less frequent events (e.g., risky 
sex). Continuous studies are typically conducted over short time periods (i.e., 1-3 days). 
Continuous studies lasting any longer than a week currently are technically difficult and 
overly burdensome to participants. 

Statistical power plays an important role in deciding the duration of the study and 
the required sample size. Fundamentally, the number of moments sampled must provide a 
reliable estimate of the phenomena for each person. We refer the reader to Bolger, Stadler, 
and Laurenceau, Chapter 16, this volume, for a more detailed description of statistical 
power and how it affects the number of observations required per person and sample size. 
As with most analyses of statistical power, the total number of observations needed per 
person and the sample size recommendations vary considerably with the complexity of 
the research hypotheses. For example, researchers who are interested primarily in asso- 
ciations among phenomena being sampled from daily experiences—and not in the role 
of individual-difference factors in moderating those associations—may require relatively 
fewer research participants. 


Technology Platform 


Appendix 5.1 also lists the most common technology platforms currently available for con- 
ducting studies in daily life. The choice of technology platform reflects a tradeoff between 
cost, complexity, and control. Computerized methods cost more and are more challeng- 
ing to implement, but they provide the greatest control over the timing elements (i.e., by 
controlling when reports are made and time-date stamping of each report). Platforms for 
computerized experience sampling include PDAs or smartphones (mobile phones outfit- 
ted with specialized configurable software that enables participants to answer questions 
about their experiences). For a discussion of device and software options, see Kubiak and 
Krog, Chapter 7, this volume. Computerized devices can be used with any of the sam- 
pling strategies, but they are especially beneficial for time-based protocols because com- 
pliance with timing can be validated. However, the use of automatically recorded times 
does not ensure compliance with event-based procedures in which participants initiate 
their own reporting. Although a self-initiated report is time-date stamped, researchers 
cannot know objectively how much time has passed since the event. Delays between the 
occurrence of the event and reporting on it have the potential to introduce memory bias. 
Computerized devices have several other advantages over noncomputerized approaches, 
including the capacity to randomize how questions are presented (to reduce response sets 
and order effects) and to record useful ancillary information, such as response latencies 
to each question. The disadvantages start with the upfront financial investment. The cost 
of units can vary considerably. There is also the potential cost for the software (which 
ranges from free to very costly; see Kubiak & Krog, Chapter 7, this volume). Computer- 
ized platforms also often impose limits on the question format. Open-ended or “free” 
responses are not easily incorporated, except perhaps for longer Internet-based daily dia- 
ries. Finally, a computerized platform may not be suitable for all subject populations (see 
the earlier section on characteristics of research participants). 

There are several lower-cost alternatives to computerized sampling with a fleet of 
PDAs or smartphones. As previously mentioned, experience sampling that uses partici- 
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pants’ own mobile phones through texting or IVR systems is becoming increasingly pop- 
ular. With texting, researchers send texts to participants’ personal mobile phones using 
an in-house server or commercial SMS (short messaging service) company for a small 
cost per text (Conner & Reid, 2012; Kuntsche & Robert, 2009). Participants receive the 
texts and reply to a set number of questions (presented either in the text or in a separate 
key) using their numeric keypad. This approach is relatively affordable and can yield high 
compliance rates. For example, in a 2-week texting study of New Zealand university stu- 
dents (using a local SMS company, www.message-media.com.au), participants on aver- 
age replied to 95% of six texts per day (SD = 11%; range = 9-100%) (Conner & Reid, 
2012). Only four participants failed to meet minimum texting criteria (set fairly high at 
75% of texts). Removing those cases from analyses, the final compliance rate was quite 
good (M = 96%, SD = 5%, range = 77-100%). Another option is to use an IVR system 
in which automated specialized software such as SmartQ from TeleSage is used to call 
participants on their mobile phones to complete a survey (see Courvoisier et al., 2010). Of 
the two options, texting has somewhat less control over timing elements. Although a text 
may be sent to participants at a certain time, they may not receive it or reply to it imme- 
diately (if their phone is turned off or they have poor reception). This may not be too 
much of an issue; Conner and Reid (2011) found that the median time to reply to texts 
was only 2 minutes (M = 16 minutes; SD = 37 minutes; range = <1 minute-9 hours). Both 
texting and IVR approaches also typically have less space for questions than do PDAs or 
smartphones. If space is an issue, it is possible to use participants’ own mobile phones as 
low-cost pagers to remind them to complete a more extensive paper-and-pencil survey. 
Last, Internet-based surveys, another low-cost alternative, are especially well suited to a 
daily diary procedure. Surveys can be designed using relatively low-cost software (e.g., 
www.surveymonkey.com). Every day participants can log on to a secured website and 
complete their report, which is time-date stamped and stored for analysis. See Gunthert 
and Wenze (Chapter 8, this volume) for more detail on Internet daily diaries. 

Paper-and-pencil surveys are still a low-cost alternative to computerized approaches. 
In this approach, people carry around booklets or complete a survey each evening prior 
to going to bed. The advantages of paper-and-pencil methods include reduced cost, less 
overhead in terms of equipment, and the allowance of open-ended questions. The main 
disadvantage is the inability to confirm objectively that surveys were completed at the 
designated times. Although there is evidence that people may not complete paper-and- 
pencil reports according to the proper schedule (Stone, Shiffman, Schwartz, Broderick, & 
Hufford, 2002), there is also evidence that paper-and-pencil methods are valid and can be 
equally informative (Green, Rafaeli, Bolger, Shrout, & Reis, 2006). This “paper or plas- 
tic” debate has clarified several things. First, it is now acknowledged that computerized 
methods are better for circumstances necessitating precise timing control and assurance; 
however, extreme concerns with paper-and-pencil questionnaires may be overstated. 
Nonetheless, we expect that researchers in the future may find it increasingly difficult to 
publish research papers that use paper-and-pencil questionnaires if a suitable computer- 
ized method is available (and appropriate for that population). 

Several different approaches can help promote compliance with paper-and-pencil 
questionnaires. A simple and often effective approach is to establish a good working 
relationship with participants and to collect surveys frequently (see Green et al., 2006). 
It is also possible for even the most low-tech of data collection platforms to include an 
electronic method for ensuring compliance. For example, in their study of everyday stress 
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among adolescents, Fuligni and colleagues (2009) asked participants to complete nightly 
checklists each day for 2 weeks. The researchers ensured compliance by providing par- 
ticipants with an electronic stamp (DateMark by Dymo Corporation, Stamford, CT) 
to place across the seal of an envelope holding that day’s checklist. This approach is far 
more convenient, and would likely yield greater compliance than requiring participants to 
place daily surveys in the mail. However, until the costs of electronic stamps are reduced 
(cost was US $65 at publication), this might not be cost-effective. Another similar option 
uses time and date stamps recorded by the Medication Event Monitoring System (MEMS 
SmartCap; Aardex Group, Union City, CA), a small container that can hold pills or sup- 
plies necessary for research participation (e.g., materials to provide saliva samples). For 
example, in a 2-week study, Applebaum and colleagues (2009) examined discrepancies 
between MEMS pill bottle openings and HIV-infected participants’ self-reported adher- 
ence with antiretroviral medications; participants reported greater medication adherence 
than was documented by the MEMS caps. 


Practical Concerns 
in Designing a Study to Capture Daily Life 


Participant Issues 


Once participants are recruited, it is important to keep them motivated throughout the 
study. The several strategies for maintaining motivation include having a complex remu- 
neration structure with incentives such as money, research credits, and lotteries. Positive 
attitudes from research assistants are also crucial. Through the authors’ and others expe- 
riences (see Green et al., 2006), data quality appears to be highest when participants feel 
they are valued, have a sense of responsibility to the research assistant, and believe that 
the research itself is important. Participants should also clearly understand the study pro- 
cedures. They should know how and what they will be asked to report, when (roughly) 
they will report, and how to use the computerized device. With devices, it is best to have 
participants answer their first prompt in the laboratory, giving them the opportunity to 
ask questions and get comfortable with the technology. Also, during the study, partici- 
pants can be given feedback on their response rate to improve compliance. For especially 
complex studies, it may be useful to have a cell phone “hotline” that participants can call 
if they have questions or equipment difficulties. 


Data Preparation and Analysis 


Each type of method and platform presents its own unique challenges to data prepara- 
tion. More detailed guidelines for data cleaning are covered by McCabe, Mack, and 
Fleeson (Chapter 18, this volume). With paper-and-pencil methods, all data must be 
entered manually—a lengthy and error-prone process. Computerized approaches bypass 
this step because data are retrieved directly from the portable device, the Internet, or the 
text-messaging server; however, they still need to be cleaned and organized. For example, 
with text-messaging-based studies, records need to be checked for duplicates that occur 
when participants inadvertently send their text responses twice. Likewise, if time and 
date stamps are used to match participant self-reports with their physiological responses 
(e.g., cortisol samples or blood pressure readings), these cases must be matched up prior 
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to analyses. Researchers should take care to ensure that equipment always has the correct 
time and date. It is also important to be aware of special date-time transitions, such as 
Daylight Savings Time and leap years. Moreover, in device-driven sampling, if reaction 
times are measured, trials with extremely fast reaction times typically indicate participant 
error, such as an inadvertent screen tap, and should be removed (see McCabe et al., Chap- 
ter 18, this volume, for recommendations). Unusually fast responding (<500 milliseconds, 
see McCabe et al., Chapter 18, this volume) can also sometimes indicate inauthentic 
responding, particularly if the person gives the same response for each item. Individuals 
with no variability in their responses should also be flagged and examined. Although 
there are some measures in which no variability is expected (e.g., a food diary in which a 
person reports eating no chocolate every day), in other phenomena such as mood, stress, 
and other varying experience, no variability may indicate a response set. These cases can 
be detected by computing for each individual a within-person standard deviation for each 
item across all days. Any participants removed for noncompliant responding should be 
noted in the write-up. 

After data have been checked and prepared, they are ready to be analyzed. The 
chapters in Part III of this handbook cover the common ways of analyzing intensive lon- 
gitudinal data. Before beginning analyses, decisions need to be made regarding treatment 
of missing data. We recommend reading the chapter on handling missing data by Black, 
Harel, and Matthews (Chapter 19, this volume). It is important to know why data may be 
missing and whether the patterns of missing data are random or reflect systematic bias. 
It is also common practice to delete individuals from analysis if they have excessive miss- 
ing data. Although minimum response rates are somewhat arbitrary and depend on the 
amount of data needed for analysis, requiring participants to complete at least 50% of 
reports seems commonplace. However, as noted by Black and colleagues, such casewide 
deletion may be too stringent an approach. 

At their core, all analytic methods in Part III recognize the “nested” nature of 
intensive longitudinal data, whereby observations are nested within people. This nested 
data structure requires proper analytic treatment to take into account shared variance 
between observations from the same individual. Multilevel modeling is one of the most 
common approaches to analyzing nested data (see Nezlek, Chapter 20, this volume). It 
allows researchers to model certain patterns within each individual and to test whether 
those patterns differ as a function of personal characteristics. It should be noted that 
observations taken closer together in time (10:00 A.M. and 12:00 P.M. on Monday) are 
typically more similar to each other than they are to other readings from a particular 
person (10:00 A.M. Monday vs. 7:00 P.M. Friday). This phenomenon of serial autocorrela- 
tion can be statistically modeled through a number of approaches, including entry of the 
preceding observation as a covariate in the multilevel model. For examples of studies that 
modeled serial autocorrelation, see Courvoisier and colleagues (2010) or Lehman and 
Conley (2010). Dyadic data analysis takes multilevel modeling one step further and mod- 
els shared variance between couples and other dyads (see Laurenceau & Bolger, Chap- 
ter 22, this volume). Other methods allow for more complex model building and draw 
on techniques such as structural equation modeling and multilevel mediational analyses 
(Eid, Courvoisier, & Lischetzke, Chapter 21, and Card, Chapter 26, this volume). Ana- 
lytic approaches can also be used to test for complex patterns of change and temporal 
dynamics (see Ebner-Priemer & Trull, Chapter 23, and Deboeck, Chapter 24, this vol- 
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ume) as well as the structure of daily experiences through within-person factor analyses 
(see Brose & Ram, Chapter 25, this volume). 


Reporting Guidelines 


When publishing daily life research, researchers are encouraged to include certain types 
of information in their write-up. Here, we highlight the most important guidelines from 
Stone and Shiffman (2002). 


1. Explain and justify the sampling strategy. Describe the type of sampling strategy; 
provide the frequency, timing, and length of the study; and briefly justify these design 
decisions. 


2. Explain the details of the data collection platform. Give a physical description of 
any computerized devices and software, including the website for software, if applicable. 
Explain how items were presented through the device, whether items were randomized, 
and whether there was any time limit for responding. 


3. Describe any participant training, monitoring, and compensation. Did partici- 
pants receive any training for the devices, receive feedback on response rates, or have 
other incentives for enhancing compliance? What other steps were taken to ensure com- 
pliance? How were participants compensated? 


4. Provide detailed compliance statistics. State the rationale for including or exclud- 
ing people from analyses; indicate how many people were excluded and why. Report 
descriptive statistics for the response rates (M, SD, min., max.) either in numbers or in 
percentages, as in “Participants responded on average to 65 out of 84 prompts, reflecting 
a response rate of 77% (SD = 10%, range = 50-98%).” Be clear whether these statistics 
included or excluded people eliminated for low or noncompliant responding. Timing 
statistics may also be appropriate for some studies. For example, text-messaging studies 
should report descriptive statistics for time delays between when the texts were sent and 
when the replies were received, if such information is available. 


5. Discuss any important data management steps taken beyond simple data cleaning 
procedures (eliminating duplicates, etc.). Examples include decisions resulting in reten- 
tion or dropping of some cases (e.g., dropping reports made outside designated time 
intervals). 


Ethical Considerations 


In describing an in-depth interdisciplinary study of the daily life of dual-career fami- 
lies (Campos, Graesch, Repetti, Bradbury, & Ochs, 2009), an article in the New York 
Times (Carey, 2010, p. A1) described the study as “oddly voyeuristic,” noting that the 
study documented “every hug, every tantrum, every soul-draining search for a missing 
soccer cleat” experienced by the families who participated in the research. In truth, the 
approaches described in this handbook are somewhat unique in their portability and 
potential intrusion into the personal and psychological space of participants. For this 
reason they are sometimes met with skepticism from those who are initially learning the 
approach. In reality, participants typically habituate relatively rapidly to providing mul- 
tiple reports or measurements over the course of a day or week, and rapidly become less 
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aware they are being observed. In our experience, participants often reported expect- 
ing the study to be more intrusive than it in fact was. Research participants also may 
be motivated by the opportunity for self-reflection and assessment that participation 
promotes. 

However, because studies of everyday life do often require participants to reflect on 
their own emotional, social, and other experiences throughout several days, several ethi- 
cal considerations become important. First, it is important to balance the requirements for 
frequency sampling and study duration against participant intrusiveness and burden. The 
study should include enough assessments to capture effectively the phenomenon of inter- 
est, but should always attend to minimizing overall participant burden. Pilot testing can 
help to ascertain the frequency of occurrence of the phenomenon of interest, and power 
analyses can help to determine the optimal sample size and frequency of assessment (see 
Bolger et al., Chapter 16, this volume). In addition, pilot testing should help to ensure that 
the assessments are as brief and clear as possible. It is also a good idea to question par- 
ticipants about the burden of participation and obtain suggestions for improvement after 
the study ends. Second, as with any study, participants should be reminded of their right 
to discontinue participation without any negative consequences. Third, it may be neces- 
sary to provide letters or gain other supplementary approvals to avoid potential negative 
consequences of research participation. For example, undergraduate student participants 
may benefit from letters to teachers or work supervisors explaining that participation in 
the experience sampling study will require the student to use a PDA or a cell phone to 
respond to a brief survey at random times throughout the day. This letter makes it clear 
that participants are not texting a friend but are participating in a research study. 

Participation in an in-depth experience sampling study may lead participants to 
reflect on their own emotions and social interactions. In fact, brief assessments of emo- 
tional states and thought processes are sometimes used as a clinical tool for promoting 
self-reflection and personal growth (Harmon, Nelson, & Hayes, 1980). Because of these 
factors, a careful, skilled debriefing session must be an important component of these 
studies. The session might usefully begin simply by acknowledging that it is possible (and 
likely) that participation in the study promotes an unusual level of self-focus, and may 
have touched on sensitive concerns or made participants feel as though their lives were 
under a magnifying glass. It is a good idea to provide community or campus resources for 
psychological consultation if a participant feels it would be useful to continue to explore 
these topics. If physiological assessments are taken, it may also be useful to provide par- 
ticipants with a summary report of their biological readings over the duration of the 
study. For example, because ambulatory blood pressure is prognostic for future disease, 
participants might be instructed to give the readings to their doctor for interpretation or 
as a “healthy baseline.” 

Reactivity effects should also be considered. These methods are unique in that they 
require people actively to attend to and verbalize their experiences and behaviors repeat- 
edly over time. This raises the question about whether intensive self-reporting changes 
the very phenomenon being measured (reactivity). Studies from the medical literature 
show small or modest reactivity effects, at least for self-reported pain (as in Stone et al., 
2003); however, there are circumstances when reactivity is more likely to occur (e.g., 
when participants are motivated to change, when monitoring only one experience or 
behavior, when testing certain individuals; Conner & Reid, 2012). These circumstances 
and other issues of reactivity are discussed in greater detail by Barta, Tennen, and Litt 
(Chapter 6, this volume). 
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Concluding Thoughts 


Part of the appeal of intensive daily life methods is the ability to use technology to capture 
humans’ lived experience. A side effect of this approach is that in a short amount of time, 
many of these technologies become outdated. Therefore, it seems prudent to conclude 
that an important part of successful research is being aware of technological advances 
and thinking creatively about how best to use these new technologies to answer a particu- 
lar theoretical question in a particular population. The growth and spread of wired and 
wireless communities and expanding access to global positioning system (GPS) tracking 
will undoubtedly allow for the immediate upload of participant responses through wire- 
less communication. As researchers, we need to stay connected to these advances and be 
innovative enough to apply them. 
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CHAPTER 6 


Measurement Reactivity 
in Diary Research 


WILLIAM D. BARTA 
HOWARD TENNEN 
MARK D. LITT 


he field of diary methods has paid remarkably little attention to measurement reactiv- 

ity. Although the literature includes a number of studies that have examined reactiv- 
ity, and we review some of these studies later in this chapter, the typical diary study either 
has been silent regarding potential measurement reactivity or has offered nothing more 
than an obligatory tip of the hat to this phenomenon in the study’s discussion section, 
followed by citations intended to put the matter to rest. Although we conclude that the 
diary literature has found scant evidence of measurement reactivity, we also assert that 
investigators have not looked very hard, and that when they have looked, their observa- 
tions have for the most part not been guided by the broader literature on measurement 
reactivity or the extensive self-monitoring literature that has emerged over the past 40 
years in the field of behavior therapy. 

In this chapter we offer an overview of the broader measurement reactivity and 
self-monitoring literatures, and their implications for diary studies. We then critically 
examine diary studies that have evaluated reactivity. We conclude this chapter with rec- 
ommendations for how diary investigators might pay more careful attention to measure- 
ment reactivity, including recommendations for how reactivity issues should be reported. 
Our goal is not to challenge current opinion that reactivity plays a modest role in diary 
studies, but to encourage investigators to be less reticent about making reactivity analyses 
an integral aspect of their research reports. 


Measurement Reactivity 


Measurement reactivity refers to the systematically biasing effects of instrumentation 
and procedures on the validity of one’s data. In a seminal and still influential discussion 
of measurement reactivity, Webb, Campbell, Schwartz, and Sechrest (1966) argued con- 
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vincingly that almost any measurement method is likely to generate reactivity. Therefore, 
in discussing measurement reactivity in relation to diary methods, the key question is 
not “Are diary methods particularly reactive?” The more instructive question is whether 
diary methods are useful in overcoming specific sources of measurement reactivity that 
characterize conventional social and behavioral research methods, and whether we can 
identify and mitigate the reactivity effects to which diary methods are particularly sus- 
ceptible. These questions received attention during the first wave of increased interest in 
diary research during the 1970s and early 1980s. Despite a second wave of interest in 
diary methods that has been growing steadily since the mid-1990s, the evidence remains 
sparse. 


Sources of Measurement Reactivity 


From a historical standpoint, diary methods emerged during a period of heightened 
attention to the ecological validity of behavioral research findings. During the 1970s, 
investigators began to question whether findings from laboratory studies generalized to 
real-world behavior. Also, the effects of completing self-administered self-report instru- 
ments in the presence of other participants or in the presence of a researcher came under 
increased scrutiny. Coinciding with new developments in ethology and ethnography, 
diary methods emerged as an attempt to strike a balance between empirical rigor and the 
desire to consider the effects of social-contextual cues on behavior. 


THE GUINEA PIG EFFECT 


Being observed or evaluated may impair performance in the presence of others if the 
task is difficult, and may enhance performance if the task is simple or has been mas- 
tered (Blascovich, Mendes, Salomon, & Hunter, 1999). These sources of reactivity are 
more pronounced when the presence of an observer is salient. Putting aside social desir- 
ability effects (discussed below), “guinea pig” effects refer specifically to the increased 
self-consciousness and evaluation anxiety that may arise in research contexts. A self- 
administered questionnaire is less reactive than a face-to-face interview, particularly with 
respect to sensitive questions (Durant & Carey, 2000), and a self-administered question- 
naire administered in a group setting is more reactive than one that is administered in a 
private setting (Bates & Cox, 2008). Along these lines, the perceived anonymity of com- 
puterized data collection methods (DiLillo, DeGue, Kras, Di Loreto-Colgan, & Nash, 
2006) offers an advantage to computerized diary studies. 

If a participant anticipates feedback or other consequences of making particular 
responses, there is a potential for bias. Mills, Murray, Johnston, and Donnelly (2008) 
noted that although people are willing to share physical health concerns with their health 
care providers, many are reluctant to share concerns about their emotional and interper- 
sonal adjustment. Similarly, research participants who anticipate discussing their diary 
entries with a researcher may underreport sensitive behaviors. 


SOCIAL DESIRABILITY 


Social desirability effects may be divided into two broad categories. Impression manage- 
ment refers to conscious attempts to cast oneself in a favorable light, whereas decep- 
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tive self-enbancement refers to the individual’s often inadvertent process of providing 
responses that are consistent with a favorable self-image (Paulus, 1991). 

In an intermethod comparison of face-to-face interviews, self-administered self- 
report questionnaires, and paper diaries, McAuliffe, DiFranceisco, and Reed (2007) 
found that, compared to other methods, diaries yielded more self-reported instances of 
sex acts without a condom and anal sex acts. Catania, Gibson, Chitwood, and Coates 
(1990) have noted that if the same respondent is interviewed on successive occasions, 
habituation to being asked sensitive questions may increase the likelihood of reporting 
socially undesirable behaviors. Once a person who admits to having engaged in a sensi- 
tive behavior and does not experience adverse consequences, it is likely that he or she will 
continue to report these behaviors. 

Of the two forms of social desirability bias, impression management is more deliber- 
ate and likely to be affected by social presence. Deceptive self-enhancement, however, the 
more pervasive form of social desirability bias, is a serious concern when the participant 
is called upon to report events or experiences from the relatively distant past, because, as 
the memory trace degrades, individuals rely on reconstructive processes (Schwarz, Chap- 
ter 2, this volume) influenced by self-enhancement. If the recall period is a day or less, 
as it is in many diary studies, remembered events and experiences may be less strongly 
affected by self-deceptive enhancement than events and experiences occurring several 
days or weeks ago. 


SATISFICING 


Satisficing refers to limits that a respondent imposes on the amount of effort she or he is 
willing to apply to answer a question or set of questions (Krosnick, 1991). Some methods 
of obtaining diary data may be more apt to evoke satisficing than others. For example, 
when diary reports are collected via phone by live interviewers, respondents may feel 
an implicit time pressure to answer questions quickly rather than leave the interviewer 
waiting (Bowling, 2005). The wider concern is that fatigued, distracted, or hurried par- 
ticipants will respond to questions inattentively (Petty & Cacioppo, 1986). Diary studies 
are likely to provoke satisficing if there is implicit time pressure to respond quickly, or if 
participant burden, in the form of time-consuming diary protocols, is particularly high. 

Research participants who are asked to summarize their own behavior retrospectively 
over a period of time may respond differently depending on whether they employ a “top- 
down” strategy (based on a global impression) or a “bottom-up” recall strategy (based 
on the recollection of specific instances; Critcher & Dunning, 2009). Diary researchers 
mitigate the bias resulting from schema-based processing by restricting retrospection to 
specific and recent events (Reis & Gable, 2000). Yet if a diary assessment includes global 
self-assessments, it is as likely to introduce bias as any other self-report method. 

To mitigate the risk of satisficing, diary researchers have been advised to limit the 
length and difficulty of diary assessments, to use simple language, to limit the number of 
clauses within questions and the number of response options per question, and to avoid 
subjective items in favor of concrete, objective items (cf. Reis & Gable, 2000; Yan & 
Tourangeau, 2008). The participant should also be thoroughly familiar with the content 
of the diary before it is administered. Although these elements of optimal diary design 
are now well known to diary researchers, they have yet to be thoroughly evaluated or 
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We conclude that participants’ perceived anonymity in a diary study reduces social 
desirability effects in comparison to other self-report methods. By minimizing the period 
of retrospection to recent and specific instances, errors of retrospection also are attenu- 
ated. The effects of satisficing are a concern in diary studies but by no means are con- 
fined to diary studies. We now examine whether the demand to engage in repeated self- 
observation through self-monitoring might alter the participant’s relationship to the 
observed behavior. 


Self-Monitoring in Assessment 


Broadly speaking, self-monitoring refers to the mindful observation of one’s own behav- 
ior, with the implication that people sometimes engage in absentminded behavior, and 
may (for example) smoke or drink at a greater volume or frequency than they recognize. 
“Habitual behaviors are automatic,” Kazdin (1980, p. 254) observed. 


Origins in Behavior Analysis 


Self-monitoring emerged as an object of scientific study for behavior analysts and 
cognitive-behavioral therapists during the middle part of the 20th century. In an influen- 
tial work, Beck (1985) described self-monitoring as a systematic process comprising three 
steps: (1) detecting when overt or covert target responses to stimuli occur, (2) systematic 
observation of stimuli and responses, and (3) identification of recurring patterns. 

Given the specific focus on the temporal ordering of stimulus—response sequences, 
researchers sought to maximize the accuracy of assessments. Many questioned the abil- 
ity of participants to recall accurately the frequency, duration, intensity, and contingency 
of specific behaviors over a period of weeks or months. A consensus emerged that more 
accurate records could be obtained if the behavior was recorded at the time and place of 
its occurrence (Nelson & Hayes, 1979). This remains the prevailing view (e.g., Barta & 
Tennen, 2008; Shiffman, 2009). 

Early advocates of the use of self-monitoring in observational studies recognized, 
however, that self-monitoring also had the potential to alter the behavior under observa- 
tion. To investigate this concern, McFall (1970) conducted studies among college smok- 
ers. Students who monitored smoking tended to increase their smoking rates, whereas 
those who were instructed to monitor nonsmoking tended to decrease their smoking rates 
during the self-monitoring period. McFall concluded that “when an individual begins 
paying unusually close attention to one aspect of his [sic] behavior, that behavior is likely 
to change, even though no change may be intended or desired” (p. 140). The reactive 
effect of close attention to a particular behavior has been documented in studies of exces- 
sive alcohol intake (Helzer, Badger, Rose, Mongeon, & Searles, 2002), amphetamine 
abuse (Hay, Hay, & Angle, 1977), overeating (Wing, Carrol, & Jeffery, 1978), and other 
behaviors. 

Other studies, however, have found no evidence that even extensive self-monitoring 
changes behavior. In a study by Litt, Cooney, and Morse (1998), alcoholics were sig- 
naled by a programmed wristwatch to report drinking, urges to drink, mood states, 
and the situations in which they found themselves eight times per day for 21 days fol- 
lowing inpatient alcoholism treatment. Responses were recorded on cards. Alcohol use 
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data from this in vivo monitoring sample were compared with data from a comparison 
group of patients who were treated in the same treatment program over the same time 
period but did not self-monitor. The comparison group and self-monitoring participants 
demonstrated equivalent levels of abstinence, alcohol use, and drinking days during the 
follow-up period. Similarly, in a study by Hufford, Shields, Shiffman, Paty, and Bala- 
banis (2002), college women were asked to complete a 2-week monitoring protocol using 
palmtop computers, during which they were prompted to record up to seven times per 
day. Drinking rates did not change from pre- to postmonitoring. This is consistent with 
a lack of reactivity. 


Conditions Influencing Reactivity of Self-Monitoring 


It is apparent that self-monitoring procedures can sometimes, but not always, yield reac- 
tive effects on the behaviors under investigation. Here we identify those factors that con- 
tribute to reactivity in self-monitoring. 


AWARENESS AND REFLECTION 


In its therapeutic applications, self-monitoring addresses the problem that “people are not 
entirely aware of the extent to which they engage in various behaviors” (Kazdin, 1980, 
p. 254). It is thought that increases in awareness and reflection may account, at least in 
part, for the relationship between self-monitoring and behavior change (Roberts & Stark, 
2008). 


MOTIVATION 


Kanfer (1970) suggested that self-observation enables a person to compare his or her 
observations to a standard of performance, which may increase motivation to change 
one’s behavior. Korotitsch and Nelson-Gray (1999), in a review, noted that the effects of 
self-monitoring on behavior change are strongest among individuals who are motivated 
to change their behavior. 


PERCEIVED DESIRABILITY OF THE BEHAVIOR 


Nelson (1977) evaluated research looking at numerous factors that influenced the accu- 
racy and reactivity of the self-monitoring experience. Chief among these was the valence 
of the behavior to be monitored. Nelson concluded that, in general, desirable behaviors 
increase and undesirable behaviors decrease while being monitored, at least for volitional 
behaviors such as drinking or smoking. Kazdin (1980) theorized that having one’s atten- 
tion drawn to a personally or socially undesirable behavior produces negative affect and 
is negatively reinforcing, just as having one’s attention drawn to desirable behaviors pro- 
duces positive affect and is positively reinforcing. 


INSTRUCTIONS OR DEMAND FOR CHANGE 


The more explicit the demand for behavior change, the greater the effects of self-monitoring 
(Kazdin, 1979). The therapeutic effects of self-monitoring appear to be greater when it is 
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presented to the participant as a treatment modality (Abrams & Wilson, 1979), or when 
the context of the activity implies a therapeutic goal. 


NUMBER OF BEHAVIORS BEING SELF-MONITORED 


Hollon and Kendall (1981) theorized that monitoring a single behavior is more reactive 
than the task of monitoring several behaviors. Consistent with this, a study of depressed 
individuals that monitored mood, activities, and events (Hammen & Glass, 1975) did 
not show signs of reactivity, but a study that monitored a single aspect of behavior (Har- 
mon, Nelson, & Hayes, 1980) did show reactivity. Intensive focus on one behavior may 
increase the salience of the behavior, and stimulate awareness and reflection concerning 
factors giving rise to the behavior. 


SEQUENCE OF MONITORING 


For the most part it appears that recording a behavior before it occurs results in greater 
reactivity than recording a behavior after it has occurred. For example, Bellack, Rozen- 
sky, and Schwartz (1974) found that self-monitoring of food intake information prior to 
eating produced greater weight loss than recording the same type of information after 
eating. Similar findings have been reported for smoking (Rozensky, 1974). Kanfer (1970) 
and Shiffman (2009) have theorized that recording prior to a behavior disrupts the auto- 
maticity of the response chain, creating an opportunity for the individual to opt for a 
more controlled response. 


EXPLICIT FEEDBACK 


Provision of feedback is considered a necessary aspect of a therapeutic approach that 
emphasizes self-control (Kazdin, 1980). Even when feedback is not provided or encour- 
aged, access to previous diary entries may trigger increased awareness and reflection. 
Hence, electronic recording devices that prevent access to previous responses are gener- 
ally favored over paper diaries that do allow access to previous entries. In a review of 
ecological momentary assessment (EMA) studies of drug and alcohol use, which involved 
multiple recordings per day, Shiffman (2009) concluded that EMA procedures, by and 
large, have not demonstrated strong reactivity. All of the studies he reviewed relied on 
electronic devices designed to block participants from reviewing prior entries. 


Summary 


Self-monitoring is employed in both assessment and treatment settings. Although the 
reactive nature of self-monitoring has been assumed, reviews of the literature have con- 
sistently painted a more complex picture. In general, the monitoring of a single behav- 
ior, explicit instructions or expectations for change, and feedback or access to previous 
records all create the conditions for reactive effects. On the other hand, when previous 
records are not accessible, multiple behaviors and cognitions are monitored, and demand 
for change is minimized, reactivity to monitoring is reduced. This is the case for even very 
intensive monitoring, such as with EMA or experience sampling protocols, as well as in 
the monitoring of daily processes. 
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Toward a More Principled Approach 
to Measurement Reactivity in Diary Studies 


The rapidly growing diary literature in social, clinical, health, and personality psychol- 
ogy includes hundreds of studies examining a broad array of behavioral phenomena and 
a far smaller group of studies examining measurement reactivity in the context of diary 
assessment. Most observational studies have paid scant attention to potential reactiv- 
ity. Although some of these studies have assessed trending in the time series, trending is 
often considered a nuisance variable to be controlled rather than a potential indicator 
of measurement reactivity. Most typically, when reactivity is addressed in a published 
report, it appears as an afterthought, with a few citations of studies that have failed to 
find reactivity effects. 

We believe that this obligatory, after-the-fact acknowledgment of the potential for 
measurement reactivity, quickly dismissed by citations of presumably supportive evi- 
dence, prompts three important questions: 


1. Do studies that have explicitly evaluated measurement reactivity in the context 
of diary studies actually converge on the conclusion that reactivity is modest or 
nonexistent? 

2. Are all of these studies comparably well designed, and do differences among the 
studies predict whether reactivity effects emerge? 

3. Even if the majority of previous measurement reactivity studies shows little or no 
evidence of reactivity, can the findings be used by investigators to ignore the issue 
of reactivity going forward? 


Citing Previous Null Findings 


We begin with the third question, because it touches on the conceptualization of mea- 
surement reactivity. As diary investigators, if we implicitly agree that we need only cite a 
few previous studies that have reported no evidence of measurement reactivity to address 
the issue adequately in our own studies, we are agreeing that the absence of meaningful 
measurement reactivity is an inherent feature of diary methods. This strikes us as a risky 
strategy, and one at odds with the standards of other psychological investigation meth- 
ods and the self-monitoring literature we have reviewed. Although a majority of social- 
psychological studies may have reported no evidence of demand characteristics using a 
particular experimental manipulation, this does not absolve investigators from assessing 
demand characteristics when they apply the manipulation to a new sample and examine 
a different outcome variable. Similarly, although most published psychotherapy studies 
have used fidelity analyses to show that their study therapists were faithful to the treat- 
ment protocol, subsequent psychotherapy studies must include their own fidelity analyses 
because protocol fidelity is not an inherent quality of manual-based psychotherapy. And 
although all published studies using a particular personality scale may have reported the 
scale’s internal consistency, authors who simply report that the scale is reliable because 
reliability has been demonstrated in previous studies will be reminded by an action editor 
that internal consistency is sample-dependent. 

Many investigators using diary methods have referred to previous studies, as if the 
absence of measurement reactivity is an inherent quality of diary methods. Yet the self- 
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monitoring literature (discussed earlier) indicates that reactivity likely depends on factors 
that vary from study to study, including (1) valence of the behavior, (2) participants’ moti- 
vation for change, (3) characteristics of the monitored behavior, (4) the number of behav- 
iors and experiences monitored simultaneously, (5) whether attitudes toward the target 
behavior are assessed concurrently with the behaviors, (6) the sequence of monitoring, 
(7) whether feedback to participants is part of the diary protocol, and (8) whether partici- 
pants have access to their previous diary entries. Therefore, we question the legitimacy of 
citing previous studies’ null findings to support the contention that a new study is free of 
measurement reactivity without at least documenting how the new study evaluated each 
of these likely predictors of reactivity. 


What Does the Literature Actually Tell Us? 


We now turn to the literature to address the first question we raised earlier, which is 
whether studies that have explicitly evaluated measurement reactivity in the context of 
diary studies converge on the conclusion that reactivity is modest or nonexistent. The 
following is only a sampling of the literature, not a comprehensive review. Quite a few 
of these studies have considered measurement reactivity as a secondary or tertiary aim 
of the study; others have examined reactivity in the context of assessing the feasibility of 
diary methods with a particular study sample, and for still other studies, the term “reac- 
tivity” does not appear in the title or keywords assigned to the article. 

Studies that have explicitly examined reactivity span an array of behaviors, including 
disordered eating (e.g., Latner & Wilson, 2002), problem drinking (e.g., Collins et al., 
1998; Simpson, Kivlahan, Bush, & McFall, 2005), smoking (Rowan et al., 2007), chronic 
pain (e.g., Aaron, Turner, Mancl, Brister, & Sawchuk, 2005; Peters et al., 2000), mari- 
tal conflict (Merrilees, Goeke-Morey, & Cummings, 2008), and sexually risky behavior 
(Gilmore et al., 2001). Some of these studies have found rather clear evidence of diary 
measurement reactivity, others have found modest or mixed evidence, and still others 
have found no evidence of reactivity. 


Studies Documenting Reasonable Evidence 
of Diary Measurement Reactivity 


A few diary studies have yielded reasonable evidence of diary reactivity across rather 
distinct psychological phenomena. Latner and Wilson (2002) had women diagnosed 
with bulimia nervosa or binge-eating disorder who were not in treatment keep continu- 
ous records of their food intake. Compared with the frequency of binge eating assessed 
during initial structured interviews, participants showed a substantial decrease in binge 
eating during the daily reporting period. A significant proportion of participants did 
not meet formal binge-eating frequency criteria during the daily reporting period, and 
among those with binge-eating disorder, more than a quarter did not binge-eat during 
the diary period. Of course, one may reasonably assert that many of these women were 
misdiagnosed because their diagnoses were not based on real-time assessments of food 
intake. Also, participants in this study may have been highly motivated to change their 
eating patterns: The valence of the behavior is negative, the behavior being monitored is 
just the type of behavior thought to change as a result of monitoring, and the investiga- 
tors monitored a single behavior. These are precisely the conditions under which diary 
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recording should lead to behavior change (Fremouw & Brown, 1980). But before we 
dismiss this evidence of diary reactivity we need to acknowledge that the diary methods 
literature has not applied the same critical eye to the many observational diary studies 
that have either ignored such behavior changes or interpreted them in terms of a favored 
theoretical construct. 

A study of pain and fatigue among rheumatology patients (Broderick & Vikingstad, 
2008) also found evidence of reactivity. Participants completed a standard depression 
questionnaire before and after completing approximately six momentary assessments a 
day for 30 days. Depression scores showed a significant pre- to posttreatment decrease. 
The authors reasoned that while frequently monitoring their pain and fatigue, study par- 
ticipants may have become aware that these unpleasant experiences showed variabil- 
ity, which resulted in fewer depressive symptoms. Some readers might be quick to point 
out that this study did not have a control group, and that without a control group the 
observed changes cannot be interpreted unambiguously. Although we agree, this obser- 
vation prompts the question of why there has been so little concern about control groups 
in the broader diary literature, an issue to which we return later in this chapter. 

Helzer and colleagues (2008) assigned heavy drinkers who had received a brief 
(several-minute) intervention focused on their drinking patterns to one of four condi- 
tions for 6 months: no daily interactive voice response (IVR) reporting, daily IVR, daily 
IVR with monthly feedback of daily alcohol use, and daily IVR with monthly feedback 
and monetary incentives for completing daily IVR. Although monthly feedback showed 
a therapeutic effect, as the self-monitoring literature would predict, we focus here on the 
finding that participants who completed daily diaries without feedback reported more 
drinking at the 3- and 6-month follow-up assessments than did participants who did 
not complete daily diaries. This study, which included a control group, is yet another to 
reveal reasonable evidence of diary measurement reactivity. 


Studies Documenting Modest Evidence of Diary Measurement Reactivity 


The majority of diary reactivity studies report modest evidence of reactivity. A number 
these studies has focused on alcohol use among heavy drinkers and smoking among 
smokers motivated to quit. In a diary study of problem drinkers, Hufford and colleagues 
(2002) found no evidence of reactivity in alcohol use. However, evidence of measurement 
reactivity emerged with regard to participants’ motivation to change their alcohol use. 
Collins and colleagues (1998), on the other hand, found that heavy drinkers recruited 
from the community decreased their drinking during real-time monitoring. In a study of 
smokers motivated to quit, Rowan and colleagues (2007) randomized participants to one 
of three groups: daily diaries for the week preceding a planned quit date, daily diaries for 
the week following the quit date, or no daily diaries. An inconsistent pattern across time 
provided some evidence of diary reactivity. 

Diary reactivity studies of behavioral phenomena other than drinking and smoking 
have also reported modest evidence of reactivity. In a study of burnout, for example, Son- 
nenschein, Sorbi, van Doornen, and Maas (2006) compared burnout symptoms among 
burned out participants and healthy controls who reported their symptoms five times 
a day for 2 weeks. Linear trends revealed modest evidence of reactivity. In the area of 
marital interaction, Merrilees and colleagues (2008) had married couples complete event- 
contingent diaries about their disagreements for 15 days. Although the diary group and a 
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nonrandomized control group could not be distinguished on expressed emotions during 
marital interactions in conflict resolution tasks, husbands’ diary reports revealed decreas- 
ing perceptions of marital quality, suggesting possible reactivity. Overall, studies across a 
range of behavioral phenomena have found some evidence of reactivity. 


Studies Reporting No Evidence of Diary Measurement Reactivity 


Among studies that have found no evidence of diary measurement reactivity, the largest 
group has examined chronic pain. In the most carefully designed study in this area, Stone 
and colleagues (2003; also see Cruise, Broderick, Porter, & Kaell, 1996) assigned indi- 
viduals with pain syndromes to EMA diaries 3, 6, or 12 times daily, or to no diary expo- 
sure. These investigators found no evidence of reactivity, including no trending of daily 
pain reports, and no evidence linking the number of daily reports to reactivity. In another 
diary reactivity study involving individuals with chronic temporomandibular disorder, 
Aaron and colleagues (2005) found that although 73% of study participants believed that 
the thrice-daily reporting for 2 weeks affected their pain, their own pain ratings did not 
support their retrospective impressions, and these investigators concluded that there was 
no evidence of reactivity on pain-related measures. Peters and colleagues (2000) similarly 
found no evidence of measurement reactivity in pain reports among individuals with 
unexplained pain who rated their pain intensity four times a day over 4 weeks. 

A lack of evidence of diary reactivity has also been reported in studies of alcohol 
use. In findings echoing those of Aaron and colleagues (2005) for chronic pain, Litt and 
colleagues (1998) found that although alcoholic participants reported that their drinking 
decreased following their participation in this diary study, the changes reported by these 
individuals did not differ from those of a control group that did not participate in the 
diary task. Similarly, Simpson and colleagues (2005) randomly assigned individuals in 
early recovery from alcohol dependence to 4 weeks of daily IVR diaries or to completion 
of a 28-day follow-up assessment only, based on retrospective reports. These investiga- 
tors found no group differences in drinking or alcohol craving. 


What Is Missing from the Diary Reactivity Measurement Literature? 


The intent of our brief review is not to provide a comprehensive overview and synthesis 
of the diary reactivity literature. This literature is not sufficiently advanced for such a 
detailed analysis. Rather, our aim is to provide a sampling of the evidence and to convey 
that this literature has yielded mixed findings, with most studies reporting modest evi- 
dence of diary-related measurement reactivity. Most striking, however, are the method- 
ological variability across studies and the limitations of many studies in this literature. 
These mixed findings, and the study limitations we now discuss, lead us seriously to 
question the rather commonplace tactic in observational diary studies of mentioning 
potential reactivity in passing, then citing a few of the negative studies in this literature, 
as if to put the matter to rest. More to the point, diary investigators have not made 
the examination of measurement reactivity as high a priority as scale reliability when 
reporting the characteristics of their questionnaire measures. The current status of the 
diary reactivity literature demands a more comprehensive assessment of measurement 
reactivity, based on the broader measurement reactivity and self-monitoring literatures 
we have reviewed. 


118 STUDY DESIGN CONSIDERATIONS AND METHODS OF DATA COLLECTION 


CONTROL GROUPS 


Very few observational diary studies and relatively few diary reactivity studies have 
included participants who did not use diaries. Smith (1999), Rowan and colleagues 
(2007), and others have underscored the importance of including one or more control 
groups in diary studies. We, like most diary investigators, are aware that every diary 
study does not need a control group to make a contribution. Yet this awareness has led to 
missed opportunities to include control groups when doing so has demonstrable benefits. 
One may argue that the absence of univariate trending in a diary study’s time series is 
sufficient evidence that reactivity was not observed, and that pre- to poststudy changes 
on questionnaires are less relevant to measurement reactivity, because the recall error and 
bias associated with almost all such questionnaires is precisely why investigators have 
turned to diary methods. We believe that the simultaneous examination of trending and 
the evaluation of pre- to poststudy changes in more investigations would go a long way 
toward advancing our understanding measurement reactivity in diary studies. 


ADEQUATELY POWERING DIARY REACTIVITY STUDIES 


As noted, many studies that examined diary reactivity have addressed the issue only as 
an ancillary aim. Interpreting null findings in such studies as evidence of the absence of 
diary reactivity, and not as Type I error, favors the conclusion that reactivity should be 
of little concern to diary investigators. We urge investigators who plan to study diary 
reactivity to design their studies so that reactivity is the primary study variable, and to 
calculate sample size based on this primary variable. 


CHANGES IN WITHIN-PERSON SLOPES AS AN UNEXAMINED REACTIVITY INDICATOR 


When a theory-derived hypothesis is not confirmed in a diary study, astute investigators 
(e.g., Marco, Schwartz, Neale, Shiffman, & Stone, 1999) consider the possibility that the 
process of diary recording might have increased participants’ awareness of their behav- 
ior and changed relations in within-person slopes. Yet this type of explanation for null 
findings in within-person slopes has not been applied to examining potential changes in 
these slopes in diary reactivity studies. For example, participants in alcohol use studies 
who regularly use alcohol to alleviate negative mood might, through providing repeated 
simultaneous reports of their affective state and alcohol use, come to realize that they 
drink to cope. Although they may not change their overall alcohol consumption, and 
would thus not show downward trending in alcohol consumption, the within-person 
association between negative mood and alcohol use might become attenuated during the 
diary recording period. Similarly, participants in a study of coping with chronic pain may 
notice after the first week of diary entries that on high-pain days they are inclined to cope 
with their pain by withdrawing from potential sources of support. This might lead to a 
change over the recording period in the within-person association between pain intensity 
and support seeking, though aggregated levels of support seeking may show no change. 
A similar dynamic may unfold for affect intensity. Silvia (2002) has noted that tasks 
that cue increased awareness of one’s current affective state may reduce affect inten- 
sity. Although diary methods examining daily affect are likely to increase awareness of 
affective states, and these methods are commonly used to measure affect intensity (e.g., 
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Gunthert et al., 2007), we are unaware of a diary study that has examined diary-induced 
changes in affect intensity (see Tennen & Affleck, 1996). Again, the rather modest evi- 
dence of diary measurement reactivity should not be particularly reassuring to investiga- 
tors to the extent that this evidence, by and large, is based on halfhearted efforts to detect 
reactivity. 

It is also important to note that reactive effects of self-monitoring are not necessarily 
linear. As discussed in the literature on “sudden gains” in psychotherapy, participants 
may exhibit pronounced improvements in mood measures between counseling visits; in 
some cases, these gains are lasting and in others they are not (Tang, DeRubeis, Hollon, 
Amsterdam, & Shelton, 2007). Nonlinear reactive effects may go unnoticed unless one 
is alert to the possibility. 


ASSESSMENT OF POSSIBLE RESPONSE SHIFT 


The response shift phenomenon (Schwartz, Sprangers, Carey, & Reed, 2004) is well 
known to quality-of-life researchers, who have found that in repeated measures study 
designs, participants may change the meaning they assign to a scale rating. For example, 
a study participant who rated her depressed mood as 5 on a 5-point scale might on a 
subsequent report rate the same intensity of depressed mood as 3 or 4 because she has 
come to realize that her depressed mood could be considerably more severe than the 
mood to which she had previously assigned a rating of 5. Similarly, participants in diary 
studies have numerous opportunities to recalibrate the meaning of their ratings based on 
repeated requests to record their experience and behavior. Yet, to our knowledge, studies 
of diary measurement reactivity have not evaluated evidence of response shifts. 


SAMPLING BIAS 


It is not uncommon for observational studies of alcohol use to have a disproportionate 
number of missing reports on weekend days (when more drinking occurs), and we sus- 
pect that missing data in diary studies of individuals with chronic pain occur dispropor- 
tionately on high-pain days. To the extent that the undersampling of weekends or high- 
pain days may provide a distorted picture of the phenomenon under investigation, these 
examples illustrate a form of measurement reactivity. 

The potential for sampling bias should guide decisions about the selection of sam- 
pled events; that is, diary studies generally sample events (or significant moments in time) 
nested within persons. The researcher faces decisions about how to draw an unbiased 
sample from the larger universe of events known as “daily life.” Shiffman (2009) observed 
that “episode-level” analyses used by researchers in the addictions field often focus on 
cues that immediately precede a “lapse.” For example, one may wish to understand the 
cues that are present shortly before someone who is attempting to quit smoking has a cig- 
arette. These cues can range from sensory stimuli (e.g., the aroma of cigarette smoke) to 
internal cues (e.g., negative affective states or the sensation of craving). Yet, as Shiffman 
noted, some researchers have “pulled back the camera” to examine broader time frames, 
and have discovered that higher-than-average self-reported negative affect often begins 
to build several hours earlier in the same day. Predictions of smoking incidents based on 
a single self-report of craving may be less precise than predictions based on an observed 
within-day trajectory of increasing craving. 
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This is an important consideration when evaluating “critical incident” studies in 
which participants are asked to log a diary entry each time they engaged in a particular 
behavior (e.g., an occasion of alcohol use), and identify cues proximal to the behavior (see 
Weinhardt & Carey, 2000, for a review). One threat to the validity of critical incident 
studies is the lack of a comparison sample of “nonevents” in which craving or negative 
affective states existed but did not result in instances of problem behavior. 


NOT ALL OF THE PEOPLE ALL OF THE TIME: MODERATED REACTIVITY 


Diary method researchers recognize that many of their most interesting findings involve 
cross-level interactions, and are careful to design their studies to include conceptually 
relevant moderator variables. Yet diary investigators have in large part ignored mod- 
erators that bear on diary measurement reactivity. In a noteworthy exception to this 
trend, Conner and Reid (2012) recently examined reactivity moderators in a 2-week short 
message service (SMS) text-messaging study in which young adult participants reported 
their momentary happiness one, three, or six times daily. Conner and Reid observed that 
participants with more depressive symptoms and those with higher neuroticism scores 
showed declines in momentary happiness over time when the protocol required a more 
intensive focus on happiness, whereas participants with fewer depressive symptoms and 
lower neuroticism scores showed increases in momentary happiness over time. Had these 
investigators not examined depressive symptoms and neuroticism as effect modifiers, 
they would have concluded, as have authors of previous studies, that there was no evi- 
dence of measurement reactivity. Once again, evidence of diary measurement reactivity 
will not emerge unless we search for it using the same methodological standards we apply 
to address substantive research questions. 


A Modest Proposal for ernie 


and Reporting Measurement Reactivity in Published Reports 


Our review of the diary reactivity literature leads us to several recommendations for how 
measurement reactivity should be evaluated and reported. Studies intended to evaluate 
diary reactivity should be explicitly designed to do so by (1) including an appropriate con- 
trol group or groups; (2) powering the study so that there is a reasonable chance of detect- 
ing reactivity; (3) examining and reporting linear trends in univariate diary variables, 
temporal shifts in questionnaire-based indicators, and changes in bivariate within-person 
slopes; (4) providing a rationale for selecting a measurement time scale; (5) reporting 
whether diary data are missing completely at random; (6) seeking evidence of possible 
response shifts; and (7) evaluating the effects of theoretically meaningful moderators on 
reactivity. 

Investigators reporting the results of observational diary studies should address reac- 
tivity long before their study’s discussion section, and with far more attention than reac- 
tivity now receives in the observational diary literature. Using the self-monitoring litera- 
ture as a guide, investigators should address (1) the valence of the behavior or experience 
measured, (2) study participants’ motivation for change and attitudes toward the target 
behavior, (3) the nature of the target behavior or experience, (4) the number of behaviors 
monitored simultaneously, (5) the sequence of monitoring, and (6) whether participants 


Measurement Reactivity in Diary Research 12] 


might have had access to their previous diary entries. At the more empirical level, inves- 
tigators should report linear trends in study variables (rather than simply controlling for 
such trends), inform readers whether diary data were missing completely at random, and 
evaluate and report whether theoretically meaningful moderators change the interpreta- 
tion of what might otherwise all too easily be viewed as no evidence of reactivity. We are 
confident that if investigators follow these empirically derived recommendations, then 
subsequent reviews of the diary reactivity literature will reveal a far more nuanced, chal- 
lenging, and interesting picture than the one portrayed in the current diary literature. 


References 


Aaron, L., Turner, J., Mancl, L., Brister, H., & Sawchuk, C. (2005). Electronic diary assessment of 
pain-related variables: Is reactivity a problem? Journal of Pain, 6, 107-115. 

Abrams, D. B., & Wilson, G. T. (1979). Self-monitoring in the modification of cigarette smoking. Jour- 
nal of Consulting and Clinical Psychology, 47, 243-251. 

Barta, W., & Tennen, H. (2008). The idiographic-nomothetic paradigm in the study of health behav- 
ior: Theory and practice. In B. P. Reimann (Ed.), Focus on personality and social psychological 
research (pp. 13-60). Hauppauge, NY: Nova Science. 

Bates, S. C., & Cox, J. M. (2008). The impact of computer versus paper-pencil survey, and individual 
versus group administration, on self-reports of sensitive behaviors. Computers in Human Behav- 
ior, 24, 903-916. 

Beck, S. (1985). Self-monitoring. In A. S. Bellack & M. Hersen (Eds.), Dictionary of behavior therapy 
techniques (pp. 197-199). New York: Pergamon. 

Bellack, A. S., Rozensky, R., & Schwartz, J. (1974). A comparison of two forms of self-monitoring in a 
behavioral weight reduction program. Behavior Therapy, 5, 523-530. 

Blascovich, J., Mendes, W. B., Salomon, K., & Hunter, S. B. (1999). Social “facilitation” as challenge 
and threat. Journal of Personality and Social Psychology, 77(1), 68-77. 

Bowling, A. (2005). Mode of questionnaire administration can have serious effects on data quality. 
Journal of Public Health, 27(3), 281-291. 

Broderick, J. E., & Vikingstad, G. (2008). Frequent assessment of negative symptoms does not induce 
depressed mood. Journal of Clinical Psychology in Medical Settings, 15, 296-300. 

Catania, J. A., Gibson, D. R., Chitwood, D. D., & Coates, T. J. (1990). Methodological problems in 
AIDS behavioral research: Influences on measurement error and participation bias in studies of 
sexual behavior. Psychological Bulletin, 108(3), 339-362. 

Collins, R. L., Morsheimer, E. T., Shiffman, S., Paty, J. A., Gnys, M., & Papandonatos, G. D. (1998). 
Ecological momentary assessment in a behavioral drinking moderation training program. Experi- 
mental and Clinical Psychopharmacology, 6, 306-315. 

Conner, T. S., & Reid, K. A. (2012). Effects of intensive mobile happiness reporting in daily life. Social 
Psychological and Personality Science, 3, 315-323. 

Critcher, C. R., & Dunning, D. (2009). How chronic self-views influence (and mislead) self-assessments 
of task performance: Self-views shape bottom-up experiences with the task. Journal of Personality 
and Social Psychology, 97(6), 931-945. 

Cruise, C. E., Broderick, J. E., Porter, L., & Kaell, A. (1996). Reactive effects of diary self-assessment 
in chronic pain patients. Pain, 67, 253-258. 

DiLillo, D., DeGue, S., Kras, A., Di Loreto-Colgan, A. R., & Nash, C. (2006). Participant responses 
to retrospective surveys of child maltreatment: Does mode of assessment matter? Violence and 
Victims, 21, 410-424. 

Durant, L. E., & Carey, M. P. (2000). Self-administered questionnaires versus face-to-face interviews in 
assessing sexual behavior in young women. Archives of Sexual Behavior, 29(4), 309-322. 

Fremouw, W. J., & Brown, J. P., Jr. (1980). The reactivity of addictive behaviors to self-monitoring: A 
functional analysis. Addictive Behaviors, 5, 209-217. 

Gilmore, M. R., Gaylord, J., Hartway, J., Hoppe, M. J., Morrison, D. M., Leigh, B. C., et al. (2001). 
Daily data collection of sexual and other health-related behaviors. Journal of Sex Research, 38, 
35-42. 


122 STUDY DESIGN CONSIDERATIONS AND METHODS OF DATA COLLECTION 


Gunthert, K. C., Conner, T. S., Armeli, S., Tennen, H., Covault, J., & Kranzler, H. R. (2007). The sero- 
tonin transporter gene polymorphism (S-HTTLPR) and anxiety reactivity in daily life: A daily 
process approach to gene-environment interaction. Psychosomatic Medicine, 69, 762-768. 

Hammen, C. L., & Glass, D. R. (1975). Depression, activity, and evaluation of reinforcement. Journal 
of Abnormal Psychology, 84, 718-721. 

Harmon, T. M., Nelson, R. O., & Hayes, S. C. (1980). Self-monitoring of mood versus activity by 
depressed clients. Journal of Consulting and Clinical Psychology, 48, 30-38. 

Hay, L. R., Hay, W. M., & Angle, H. V. (1977). The reactivity of self-recording: A case report of a drug 
abuser. Behavior Therapy, 8, 1004-1007. 

Helzer, J. E., Badger, G. J., Rose, G. L., Mongeon, J. A., & Searles, J. S. (2002). Decline in alcohol con- 
sumption during two years of daily reporting. Journal of Studies on Alcohol, 63, 551-558. 

Helzer, J. E., Rose, G. L., Badger, G. J., Searles, J. S., Thomas, C. S., Lindberg, S. A., et al. (2008). Using 
interactive voice response to enhance brief alcohol intervention in primary care settings. Journal 
of Studies on Alcohol and Drugs, 69, 251-258. 

Hollon, S. D., & Kendall, P. C. (1981). In vivo assessment techniques for cognitive-behavioral pro- 
cesses. In P. C. Kendall & S. D. Hollon (Eds.), Assessment strategies for cognitive-behavioral 
interventions (pp. 319-362). New York: Academic Press. 

Hufford, M. R., Shields, A. L., Shiffman, S., Paty, J., & Balabanis, M. (2002). Reactivity to ecological 
momentary assessment: An example using undergraduate problem drinkers. Psychology of Addic- 
tive Behaviors, 16, 205-211. 

Kanfer, F. H. (1970). Self-monitoring: Methodological limitations and clinical applications. Journal of 
Consulting and Clinical Psychology, 35, 148-152. 

Kazdin, A. E. (1979). Unobtrusive measures in behavioral assessment. Journal of Applied Behavior 
Analysis, 12, 713-724. 

Kazdin, A. E. (1980). Behavior modification in applied settings (rev. ed.). Homewood, IL: Dorsey 
Press. 

Korotitsch, W. J., & Nelson-Gray, R. O. (1999). An overview of self-monitoring research in assessment 
and treatment. Psychological Assessment, 11, 415-425. 

Krosnick, J. A. (1991). Response strategies for coping with the cognitive demands of attitude measures 
in surveys. Applied Cognitive Psychology, 5, 213-236. 

Latner, J. D., & Wilson, G. T. (2002). Self-monitoring and the assessment of binge eating. Behavior 
Therapy, 33, 465-477. 

Litt, M. D., Cooney, N. L., & Morse, P. M. (1998). Ecological momentary assessment (EMA) with 
alcoholics: Methodological problems and potential solutions. Health Psychology, 17, 48-52. 
Marco, C. A., Schwartz, J. E., Neale, J. M., Shiffman, S., & Stone, A. A. (1999). Do appraisals of daily 
problems and how they are coped with moderate mood in everyday life? Journal of Consulting and 

Clinical Psychology, 67, 755-764. 

McAuliffe, T. L., DiFranceisco, W., & Reed, B. R. (2007). Effects of question format and collection 

mode on the accuracy of retrospective surveys of health risk behavior: A comparison with daily 

sexual activity diaries. Health Psychology, 26(1), 60-67. 

McFall, R. M. (1970). Effects of self-monitoring on normal smoking behavior. Journal of Consulting 

and Clinical Psychology, 35, 135-142. 

Merrilees, C. E., Goeke-Morey, M., & Cummings, E. M. (2008). Do event-contingent diaries about 

marital conflict change marital interactions? Behaviour Research and Therapy, 46, 253-262. 

Mills, M. E., Murray, L. J., Johnston, B. T., & Donnelly, M. (2008). Feasibility of a standardized qual- 
ity of life questionnaire in a weekly diary format for inoperable lung cancer patients. European 
Journal of Oncology Nursing, 12, 457-463. 

Nelson, R. O. (1977). Methodological issues in assessment via self-monitoring. In J. D. Cone & R. 
P. Hawkins (Eds.), Behavioral assessment: New directions in clinical psychology (pp. 217-240). 
New York: Brunner/Mazel. 

Nelson, R. O., & Hayes, S. C. (1979). The nature of behavioral assessment: A commentary. Journal of 
Applied Behavior Analysis, 12, 491-500. 

Paulus, D. L. (1991). Measurement and control of response bias. In J. P. Robinson, P. R. Shaver, & L. 
S. Wrightsman (Eds.), Measures of personality and social psychological attitudes (pp. 17-59). San 
Diego, CA: Academic Press. 

Peters, M. L., Sorbi, M. J., Kruise, D. A., Kerssens, J. J., Verhaak, P. F. M., & Bensing, J. M. (2000). 


Measurement Reactivity in Diary Research 123 


Electronic diary assessment of pain, disability and psychological adaptation in patients differing 
in duration of pain. Pain, 84, 181-192. 

Petty, R. E., & Cacioppo, J. T. (1986). Communication and persuasion: Central and peripheral routes 
to attitude change. New York: Springer-Verlag. 

Reis, H. T., & Gable, S. L. (2000). Event-sampling and other methods for studying everyday experi- 
ences. In H. T. Reis & C. M. Judd (Eds.), Handbook of research methods in social and personality 
psychology (pp. 190-222). Cambridge, UK: Cambridge University Press. 

Roberts, C., & Stark, P. (2008). Readiness for self-directed change in professional behaviours: Factorial 
validation of the Self-Reflection and Insight Scale. Medical Education, 42, 1054-1063. 

Rowan, P. J., Cofta-Woerpel, L., Mazas, C. A., Vidrine, J. I., Reitzel, P. M., Cinciripini, P. M., et 
al. (2007). Evaluating reactivity to ecological momentary assessment during smoking cessation. 
Experimental and Clinical Psychopharmacology, 15, 382-389. 

Rozensky, R. H. (1974). The effect of timing of self-monitoring behavior on reducing cigarette con- 
sumption. Journal of Behavior Therapy and Experimental Psychiatry, 5, 301-303. 

Schwartz, C., Sprangers, M., Carey, A., & Reed, G. (2004). Exploring response shift in longitudinal 
data. Psychology and Health, 19, 51-69. 

Shiffman, S. (2009). Ecological momentary assessment (EMA) in studies of substance use. Psychologi- 
cal Assessment, 21, 486-497. 

Silvia, P. J. (2002). Self-awareness and emotional intensity. Cognition and Emotion, 16, 195-216. 

Simpson, T. L., Kivlahan, D. R., Bush, K. R., & McFall, M. E. (2005). Telephone self-monitoring 
among alcohol use disorder patients in early recovery: A randomized study of feasibility and mea- 
surement reactivity. Drug and Alcohol Dependence, 79, 241-250. 

Smith, A. F. (1999). Concerning the suitability of recordkeeping for validating and generalizing about 
reports of health-related information. Review of General Psychology, 3, 133-150. 

Sonnenschein, M., Sorbi, M. J., van Doornen, L. J. P., & Maas, C. J. M. (2006). Feasibility of an elec- 
tronic diary in clinical burnout. International Journal of Behavioral Medicine, 13, 315-319. 
Stone, A. A., Broderick, J. E., Schwartz, J. E., Shiffman, S., Litcher-Kelly, L., & Calvanese, P. (2003). 
Intensive momentary reporting of pain with an electronic diary: Reactivity, compliance, and 

patient satisfaction. Pain, 104, 343-351. 

Tang, T. Z., DeRubeis, R. J., Hollon, S. D., Amsterdam, J., & Shelton, R. (2007). Sudden gains in cogni- 
tive therapy of depression and depression relapse/recurrence. Journal of Consulting and Clinical 
Psychology, 75(3), 404-408. 

Tennen, H., & Affleck, G. (1996). Daily processes in coping with chronic pain: Methods and analytic 
strategies. In M. Zeidner & N. S. Endler (Eds.), Handbook of coping (pp. 151-180). New York: 
Wiley. 

Webb, E. J., Campbell, D. T., Schwartz, R. D., & Sechrest, L. (1966). Unobtrusive measures: Nonreac- 
tive research in the social sciences. New York: Rand McNally. 

Weinhardt, L. S., & Carey, M. P. (2000). Does alcohol lead to sexual risk behavior?: Findings from 
event-level research. In J. R. Heiman, C. M. Davis, & S. L. Davis (Eds.), Annual review of sex 
research (Vol. 11, pp. 125-167). Allentown, PA: Society for the Scientific Study of Sexuality. 

Wing, R. R., Carrol, C., & Jeffery, R. W. (1978). Repeated observation of obese and normal subjects 
eating in the natural environment. Addictive Behaviors, 3, 191-196. 

Yan, T., & Tourangeau, R. (2008). Fast times and easy questions: The effects of age, experience, and 
question complexity on Web survey response times. Applied Cognitive Psychology, 22, 51-68. 


CHAPTER 7 


Computerized Sampling 
of Experiences and Behavior 


THOMAS KUBIAK 
KATHARINA KROG 


C onducting experience sampling studies has been made substantively easier over 
the last two decades with readily available computerized solutions. Computerized 
approaches to sample experiences and behaviors in situ have seen a sharp increase. Two 
of the main reasons for this rise in computerized sampling most probably lie in the falling 
costs of suitable mobile devices over the years and the availability of software packages to 
implement experience sampling studies. In the early 2000s, researchers had to do most of 
the programming themselves because dedicated applications were lacking (e.g., Feldman 
Barrett & Barrett, 2005; Fahrenberg, Hüttner, & Leonhardt, 2001). Today, both free 
and commercial software solutions are readily available and easy to use, allowing users 
quickly to design and implement their study protocol. 

Besides the increase in the number of available software packages, one can identify 
at least three trends out over the past years. Experience sampling software is much more 
user-friendly than it used to be in the past, with script-like batch programming dropped 
in favor of user-friendly integrated software solutions that now focus more often on the 
integration with monitoring data from other sources, including physiological or activ- 
ity data, or data on concurrent environmental processes, such as ambient noise or tem- 
perature. Self-report diary data may then be easily mapped onto other processes. Finally, 
given the capabilities of modern mobile devices, current software solutions focus more 
and more on a seamless integration with (Inter)net-based services using Wi-Fi-enabled 
mobile devices or smartphones. 

Our aim in this chapter is to offer the reader advice about which criteria to consider 
before choosing a particular software package for experience sampling and a current 
overview of available solutions. Unfortunately, in many published articles on experience 
sampling studies, information on the software and devices used is missing, so orientation 
is sometimes difficult for researchers new to the field. Moreover, we want to give some 
practical advice from our own experience in the field when it comes to implementing 
computerized experience sampling. 
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The chapter is divided into four parts. First, we give a short overview of currently 
available platforms. We also dare to venture some—necessarily subjective—predictions 
as to which platform is the safest choice for a researcher who seriously wants to engage 
in computerized experience sampling in the years to come. Second, we will briefly outline 
criteria one should consider before choosing a particular software solution. In our view, 
the criteria for software choice should be based largely on the features one needs for a 
given study. Third, we give an overview of currently available software solutions, with 
a particular focus on open-source software. Commercial software is included as well. 
Finally, we give some helpful hints for implementing computerized experience sampling 
using a given software solution and platform. We conclude this chapter with some final 
remarks on future trends and possibilities that are to be expected in the field. 

We have tried hard to be comprehensive. However, we included only software we 
were able to test thoroughly ourselves. Not all software companies replied to our request 
for a demonstration version of their software. Naturally, despite our efforts to identify 
suitable software for experience sampling, it could easily be the case that we have simply 
missed some relevant products. In case we missed some interesting and suitable software, 
we would be pleased if reader(s) would contact us via e-mail. 

In this chapter we focus on dedicated software solutions for mobile data acquisition, 
enabling researchers to design and implement their computerized experience sampling 
studies themselves. We do not address solutions based on text messaging that uses par- 
ticipants’ mobile phones, which for some applications may be a useful and less costly 
alternative compared to use of dedicated devices (see Conner & Lehman, Chapter 5, this 
volume). Neither do we cover full-service companies that provide “all-in-one” solutions, 
including study design, mobile device programming, and analyzing the results (e.g., invi- 
vodata, www.invivodata.com). 


Platforms for Computerized Experience Sampling 


Concerning the mobile hardware, reduced costs are only one side effect of recent develop- 
ments. The rapidly evolving field of mobile computing sees new technology, devices, and 
operating systems emerging in ever-shorter cycles. For researchers interested in comput- 
erized experience sampling, difficult decisions arise: Which software and which mobile 
device with which operating system should I use? These questions cannot be answered 
easily, as illustrated by some examples. In a previous review on available software solu- 
tions, Ebner-Priemer and Kubiak (2007) highlighted software for the then-common Palm 
OS and Windows Mobile/Pocket PC platforms. Since then, Palm OS has been dropped 
by Palm Inc., and has in turn been acquired recently by Hewlett Packard. A large variety 
of software solutions building on Palm OS are still used by many researchers, who now 
have to rely more and more on secondhand mobile devices, because no major company 
produces Palm OS-based devices anymore. Windows Mobile in turn faces a major ver- 
sion release soon, with backward compatibility being a potential issue if one wants to run 
software for experience sampling that worked fine with previous releases. In summary, 
longevity of operating systems (OSs) and compatibility issues are hardly predictable. For 
example, in our own lab, we used three different Palm models based on Palm OS, two 
different models with Windows Mobile, and one system still running Psion EPOC, with 
approximately a dozen devices for each model in the past years. 
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At the time of writing this chapter, seven operating systems for mobile devices are in 
widespread use (in alphabetical order): 


e Android: Android, a dedicated operating system developed for smartphones by 
the Open Handset Alliance consortium, includes software and hardware companies, as 
well as telecommunications providers. The distributing company, also named Android, 
is owned by Google. Basically, Android is built on a Linux platform and differs from its 
competitors, specifically the iOS, in that the OS’s code is open to the public and handled 
less restrictively. Android runs on smartphones (e.g., HTC, Samsung, and LG). 


e BlackBerry OS: BlackBerry OS, formerly known as BlackBerry RIM, is a propri- 
etary but free platform designed to drive BlackBerry’s mobile devices. We know of no 
stand-alone client software for experience sampling that uses this OS, but several server- 
based solutions are compatible with it. 


e iOS: This version of Mac OS (i.e., essentially a UNIX flavor) is the operating sys- 
tem of the iPhone and the iPod touch series, and the iPad. Both the iPod touch and the 
iPad are compatible with the iPhone but lack phone functions. Hence, particularly the 
iPod touch could be a lower cost hardware if one wants to run iOS software for experi- 
ence sampling. 


e Linux versions can be found for virtually any mobile computing hardware. This 
holds for Palm devices, as well as devices originally designed for Windows Mobile. More- 
over, Linux runs successfully on “exotic” hardware such as the Sony PlayStation Portable 
or the Nintendo DS, which principally opens up a vast range of opportunities given the 
power of gaming devices (e.g., to display high-resolution graphics or accurate registra- 
tion of reaction times, not to mention increasing compliance when studying adolescents). 
However, for nearly all mobile architectures Linux is still at beta stage, albeit rapidly 
evolving. One has to be aware of the need to put considerable programming skills and 
effort into customizing software to the specifics of these architectures compared to “out 
of the box” solutions. Nonetheless, Linux has many strengths and could be a promising 
candidate for future developments in computerized experience sampling given its open- 
source license and cross-platform availability. 


e Palm PrelwebOS is the successor of the classic Palm OS and currently used on 
Palm smartphones only. Unfortunately, it lacks compatibility with Palm OS applications, 
so well-known experience sampling software (ESP) will not work, although a “classic” 
environment software add-on has been announced to add Palm OS compatibility. The 
future of the operating system, however, is unsure, as the company was acquired by 
Hewlett Packard recently. 


e Symbian OS, a platform for mobile phones and smartphones, is a open-source 
project hosted by Nokia. Symbian OS runs on a variety of mobile phones (e.g., Nokia or 
Ericsson). Historically, Symbian OS evolved out of Psion’s EPOC (see below). 


e Windows Mobile. This Microsoft Windows derivative (formerly known as Pocket 
PC or Windows CE) is run on a large number of mobile devices, including the iPaq series 
by Hewlett Packard. 


As this overview of currently used platforms indicates, there is a clear trend toward 
smartphone OSs and, hence, smartphones as mobile clients. In our view, it is only a 
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matter of time that stand-alone handheld devices such as the classic Palm models will 
eventually vanish given the current rise in the smartphone sector. In the near future, this 
may lead to the situation that smartphones have to be used for research, even though 
smartphone functionalities are not necessary to realize a given research project. 

Another possibility is to rely on used devices: Sophisticated experience sampling 
software is available for other, well-established operating systems that are not used any- 
more on any of the current devices. These include, foremost, Palm OS, in which a variety 
of well-known and well-established software exists. The same holds for EPOC by Psion, 
which went out of the handheld business in the late 1990s. It will be increasingly hard to 
get one’s hand on the devices needed to run these OSs. It is, for instance, virtually impos- 
sible to buy new Psion handheld computers. Nonetheless, we decided to include Palm 
OS- and EPOC-based experience sampling software in the next section. In our view this 
software is clearly still worth a try, if one is not bothered by the effort of looking for used 
devices. 


What to Consider before 
Choosing an Experience Sampling Software Package 


One decides what processes one wants to study, then chooses the software and platform. 
There is no single-best solution. Before opting for one software solution or another it is 
necessary to consider at least three issues: (1) the capabilities of the software needed to 
implement a given study, (2) the population of interest that one intends to study, and (3) 
the degree to which one plans to link the electronic diary assessment with other sources 
of data (e.g., physiological recordings, geospatial information). 

Most important, one chooses experience sampling software on the basis of the fea- 
tures needed for a particular study. Key features to consider are, for example, the soft- 
ware’s capabilities for implementing different sampling schemes (see Conner & Lehman, 
Chapter 5, and Moskowitz & Sadikaj, Chapter 9, this volume) and the available options 
for item formats, which can range from Likert-type rating scales to visual analog scales 
or sliders. One should also consider the need for special features, such as capabilities to 
include not only self-reports but also in-field (cognitive) performance testing to provide 
real-time data analysis options (e.g., for feedback purposes) or to implement adaptive 
sampling schemes (e.g., change in sampling frequency depending on the participant’s 
answers). 

Next, one considers which population one wants to study. Some software may be 
more suitable for specific populations. For instance, for older adult participants or par- 
ticipants with reduced psychomotor capabilities, using a platform and software with 
stylus-type input may be problematic, whereas using platforms with keyboard support or 
a touch screen that can be operated with one’s fingers may be the better option. In some 
populations (e.g., in input/output [I/O] settings) the intrusiveness of typing into a hand- 
held computer may be an issue and voice-over-triggered events and/or voice recording 
could be a useful option. Finally, special features, like external triggering by physiologi- 
cal signals, should be considered. 

When choosing a particular device, keep some caveats in mind. Some of the fea- 
tures implemented in the software are device-dependent. This holds, for example, for 
signaling options or the impact of a given device’s display size on the available options 
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for presenting items. Moreover, even if software has been developed and well tested on 
a given platform, it may not run smoothly on all mobile devices that run the respective 
OS. This is particularly true for Windows Mobile-based devices, where a large diversity 
exists. Technical information on device compatibility and listings of tested and supported 
devices are available for most of the software solutions and should be consulted. 

We want to conclude this section with some practical remarks concerning the costs 
of device acquisition. For stand-alone handheld computers, prices have dropped sharply 
in recent years, ranging from about $100 to $300 for a single device at the time of this 
writing. Most companies and resellers offer an additional, quite substantial quantity dis- 
count if approximately 50 or more devices are purchased. If one opts for a smartphone 
solution, the situation is more difficult. Unlocked smartphones without a telecommunica- 
tions provider contract, which would be everything one needs if one does not want to use 
any phone features, are still expensive. Smartphones with a contract are subsidized and 
less expensive but usually inflict monthly costs. Please note that some major telecommu- 
nications providers may offer discounts for smartphones with contracts if one contacts 
the corporate customer service department. However, one has to plan far into the future 
because the contract period is usually 1 year or more. 


Software Packages 


In this section we review software packages for experience and behavior sampling. Please 
note that we opted to restrict our overview to software that the developers/companies 
made available to us, so that we could test and review it thoroughly. We describe free 
software first, then turn to commercial software solutions. In Table 7.1 we give an over- 
view of key features of the software packages. Additional information, such as contact 
details and license policies at the time of this writing, is given in Table 7.2. Key features 
that we use to describe the software are as follows: 


1. Platform, that is, the OS of the mobile device and the OS for the desktop-based 
application to program the mobile clients (if applicable). 

2. Supported item formats; in addition to standard item formats, we consider some 
multimedia capabilities to fall within this category, such as the recording of 
pictures, sound, or video, or the triggering of events and data entry via spoken 
commands. Voice recording can be particularly useful if entry of large amounts 
of text is planned. 

3. The availability of cognitive performance testing implemented on the mobile 
device. 

4. Control flow features, such as conditional branching or loops. 

5. The possibility to trigger data entry by an external signal (e.g., a physiological 
signal). 

6. Capability for real-time processing of entered data, which is, for example, a 
prerequisite for a wide range of applications in which the participants receive 
immediate feedback via the mobile device. 

7. The programming environment, with broadly two categories: Batch script-like 
programming or dedicated applications (integrated development environments 
[IDEs]). 
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8. Supported sampling schemes, including signal-contingent and event-contingent 
schemes, a combination of both (mixed schemes), and individualized schemes 
(e.g., conditional real-time modification of a signal-contingent sampling scheme 
depending on the data entered by the participant). 

9. Signaling options (acoustic signals, vibration). 

10. Output formats, which range from comma-separated values (CSVs) and plain 
text (both possibly the most useful formats) to proprietary formats of standard 
statistical software packages. 

11. Special features. Among the most important special features to consider are geo- 
tagging via global positioning system (GPS), integration with other monitoring 
data (e.g., physiological recording that triggers self-report prompts), and integra- 
tion with Internet-based service for data transmission (and feedback). The latter 
may include the transmission via short messaging services, or via the Internet or 
LAN using UMTS/EDGE/GPRS technology or Wi-Fi access. 


Free Software 


Fortunately, researchers nowadays can also choose from a wide range of software that is 
offered free of charge, thanks foremost to the pioneering work of researchers who devel- 
oped their own solutions, such as ESP and MONITOR, to overcome the lack of suitable 
software. Free software usually comes with different licensing policies (e.g., GNU Gen- 
eral Public License; Free Software Foundation, 2007; see www.opensource.org for an 
overview). The main difference that could be of interest to researchers here is whether the 
software is “true” open source, that is, that the software’s source code is available, and 
whether one is allowed to change and adapt the code for one’s own purposes. Please note 
that the more complex and comprehensive free software solutions, such as the MyExperi- 
ence project, do not come as an out-of-the-box, ready-to-go product and usually require 
at least some programming skills to run them properly. We identified the following free 
software solutions that can be used for computerized experience sampling (in alphabeti- 
cal order). 


EPICOLLECT 


EpiCollect is not a classic client-based experience sampling software, but a server-based 
solution that relies on Google’s App service that is compatible with Android- and iOS- 
based smartphones. It was developed at the Imperial College, London, mainly for con- 
ducting community-based public health surveys (e.g., Aanensen, Huntley, Feil, al-Own, 
& Spratt, 2009). 

EpiCollect is free of charge, but registering on Google is necessary (free) to use the 
service. The software does not offer much with regard to data entry formats: Only text 
entry and two types of (multiple) choice lists are offered. Sampling schemes cannot be 
implemented; participants enter data whenever they want (or are instructed to do). How- 
ever, some of the features are unique and can be of interest for particular applications. 
First, the mobile clients transmit data, including geospatial data (geotagging) and, if 
desired, photos online to the server. Second, these data can be accessed anytime and 
downloaded in tabular form or mapped using the Google Earth engine. Third, partici- 
pants may be contacted anytime using the built-in chat function. 
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EXPERIENCE SAMPLING PROGRAM 


The ESP package (Feldman Barrett & Barrett, 2001, 2005) has been around since the late 
1990s and is still in widespread use. Developed by Daniel J. Barrett and Lisa Feldman 
Barrett, ESP offers basic capabilities to implement “classic” experience sampling studies, 
that is, a signal-contingent assessment at random time points to get a comprehensive and 
representative overview of the daily life self-report measures of interest (see Conner & 
Lehman, Chapter 5, this volume). ESP is rather limited with regard to the choice of sam- 
pling schemes (no flexible, freely programmable, fixed time signals; no event-contingent 
sampling; no branching of questions). However, the software is very stable and a good 
choice if a study does not require sophisticated sampling schemes. ESP runs on Palm 
OS-based devices only. The software has been used in numerous studies (e.g., Tugade, 
Fredrickson, & Feldman Barrett, 2004). 


MONITOR 


The software package MONITOR was developed by the Psychophysiology Research 
Unit of the University of Freiburg, Germany (Fahrenberg et al., 2001). MONITOR runs 
on EPOC-based Psion handheld computers (Series 3a and above), which are not available 
for sale anymore. However, some of the software features are unique, and the software 
and the Psion devices themselves are very robust, so researchers may opt to locate used 
devices. MONITOR was among the first software to support external triggering, for 
example, by a physiological signal. It also offers a range of three cognitive tests, a fea- 
ture seldom found in more current software. Moreover, the Psion series mobile devices 
are equipped with a full keyboard. While this makes the device more bulky than cur- 
rent models, text entry is substantively easier and may meet the needs of special study 
populations, such as older adults. The software comes with a comprehensive manual 
in German only. MONITOR’s full documented source code was made available by the 
authors a couple of years ago and may offer future prospects for this software. However, 
we are unaware of any attempts to develop MONITOR further and port it onto current 
flavors of EPOC (e.g., Symbian OS). Examples of published studies in which MONITOR 
was successfully used are Fahrenberg, Bruegner, Foerster, and Kaeppler (1999), Ebner- 
Priemer and Sawitzki (2007), and Kubiak and Jonas (2007). 


MYEXPERIENCE 


The MyExperience project initiated by Jon Froehlich of the University of Washington is 
a community-driven project that builds on the input of its contributors. One of the most 
interesting features of MyExperience is its modular design and connectivity (Froehlich, 
Chen, Consolvo, Harrison, & Landay, 2007). The software runs on Windows Mobile 
devices and builds on Microsoft’s .NET architecture. MyExperience was strongly inspired 
by Stephen Intille’s pioneering work on context-aware experience sampling (Intille, 2007; 
Intille, Rondoni, Kukla, Anacona, & Bao, 2003); the software is characterized by an 
open, expandable architecture that allows the experience sampling component to be 
integrated with any other source of data, ranging from ambient signals and geospatial 
information to physiological data. Setting up and installing the software and particularly 
programming a study can be a tedious chore for someone not into (hobby) program- 
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ming: Programming MyExperience is still done by writing up code (a commercial deriva- 
tive of the software exists, though, providing an IDE that makes programming much 
easier). Because MyExperience runs on Windows Mobile, we strongly advise consulting 
the developers’ compatibility list and testing any given mobile device for compatibility 
issues. Two particular strengths of the projects lie in the support of the developers and the 
appeal of developing the software further. The MyExperience project is about to leave the 
beta version stage soon (at the time of this writing the current version is 0.9.1). 


OPENXDATA 


Like MyExperience, openXdata is a community-driven, open-source project (with addi- 
tional paid support available upon request). One of the key aims of the openXdata project 
is to develop a software platform for data collection of international vaccine trials. It is 
embedded within the Open Mobile Electronic Vaccines (OMEVACS) project. The data 
collection software has evolved—in part—out of the Epihandy software, which is not 
being developed further but is still available for download (www.epihandy.com). Project 
partners are clinical and research institutions, and include the University of Bergen, the 
World Health Organization (WHO), and the Promise Consortium. Whereas MyExperi- 
ence focuses on its open architecture and the integration of other sources of data and 
sensor systems, openXdata focuses on its suitability for use within clinical trial systems, 
on capabilities of online data transmission from the mobile clients, and on multilingual 
support. While these features are not essential for many areas of computerized experience 
sampling, the software still provides a vast range of features that are well suited in this 
regard. Setting up openXdata requires some programming skills even though there is a 
browser-based user interface available. Like MyExperience, the software is undergoing 
rapid development and changes. 


PURDUE MOMENTARY ASSESSMENT TOOL 


The PMAT software developed by Howard Weiss and colleagues at the Purdue Military 
Family Research Institute runs on Palm OS mobile devices (Weiss, Beal, Lucy, & Mac- 
Dermid, 2004). PMAT comes with a simple and easy-to-use Java-based programming 
environment. Various control flow features and sampling schemes are supported. The 
latter include signal-contingent and event-contingent schemes, as well as mixed schemes 
in which signal-contingent and event-contingent sampling may be combined. Examples 
of studies that have used PMAT are Beal, Trougakos, Weiss, and Green (2006) and Trou- 
gakos, Beal, Green, and Weiss (2008). 


Commercial Software 


Commercial experience sampling software comes with diverse licensing schemes that 
range from a one-time fee to rental of the software for a given period of time, to perpetual 
licensing (see Table 7.2). Moreover, several companies charge for each participant that 
is studied or for each mobile device in use. Some software is available free of charge for 
academic use, though. Most of the suitable software in the commercial sector has not 
been developed for dedication to daily process research, but as tools for marketing or 
public health surveys, so be aware that some terminology used to describe features in the 
software’s product information and manual differs from terminology with an experience 
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sampling perspective (e.g., notification features that refer to sampling schemes). More- 
over, some features may be less useful for most experience sampling purposes (e.g., the 
capability to read bar codes). The following software packages were made available to us 
for testing and reviewing purposes (in alphabetical order). 


ENTRY WARE 


The main field of application of Entryware (Techneos, Inc., Vancouver, Canada) is mar- 
keting research. Entryware runs on Palm OS and Windows Mobile. The software is 
highly customizable in terms of item formats, sampling schemes, and, most notably, con- 
trol flow features: The software, for instance, allows for dynamic changes in item word- 
ings depending on participants’ answers. An example of the use of Entryware for experi- 
ence sampling is the study by Bernhardt and colleagues (2009), in which the software was 
used to assess self-reports of alcohol intake. The company has recently published another 
software package for the sampling of self-reports called SODA, specifically developed for 
current smartphone devices running iOS, Android, or Blackberry OS. SODA is described 
later in this section. 


IDIALOGPAD 


The iDialogPad software developed by Gerhard Mutz of Cologne University (Germany) is 
iOS-based and, hence, only runs on Apple mobile devices. Rudimentary features for cogni- 
tive testing are available, and GPS is supported (if available for a given device). Compared 
to other recent software, the programming is done in an “old school” way using scripts fed 
into the mobile clients. The iDialogPad software is currently being used in Germany in a 
large, multicenter trial on computer-assisted treatment of panic disorder. 


IFORMBUILDER 


This software, developed by Zerion Software, Inc. (Herndon, VA), builds on a server- 
based architecture and is compatible with smartphones running iOS (and, in the near 
future, Android). The iOS mobile application is offered free of charge and can be down- 
loaded via Apple’s App Store. iFormBuilder supports basic item formats and conditional 
branching. Sampling schemes are limited: Real-time notifications may be sent to the par- 
ticipant. 


IZYBUILDER 


IzyBuilder is a commercial software package developed by IzyData (Fribourg, Switzer- 
land). Although it is a commercial product, the software comes with special licensing 
options for academia. One of the strengths of IzyBuilder lies in its intuitive development 
environment, where via drag-and-drop the user can very easily adapt the way items and 
texts are presented. IzyBuilder offers by far the most freedom compared to other software 
in terms of custom item presentations. Moreover, an emulator is included, allowing the 
user to test programs on a desktop computer. While previous versions of IzyBuilder had 
been available for both Windows Mobile and Palm OS platforms, the developers dis- 
continued the Palm OS branch of the product and focus now on Windows Mobile only. 
Whereas previous versions of the software were—at times—notorious because of stabil- 
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ity issues, the current version seems to have been improved considerably in this regard. 
Published research that has used IzyBuilder includes Wilhelm and Schoebi (2007), Reich- 
erts, Salamin, Maggiori, and Pauls (2007), and Siewert, Antoniw, Kubiak, and Weber (in 
press). 


MOBILECAT 


This software was developed by Matthias Rose, Jakob Bjorner, and John Ware, supported 
by a grant from the National Institute of Mental Health. The software’s primary objec- 
tive is to provide a platform for computerized adaptive testing, particularly the assess- 
ment of patient-reported outcomes within an item response theory (IRT) framework (e.g., 
Bruce et al., 2009). The software runs on all currently available iOS clients (iPhone, iPod 
touch, iPad) and is offered free of charge for academic use. The software is still in the 
beta stage (current version 0.9). Basic item formats are implemented, as are control flow 
features such as conditional loops and—important for the intended IRT application— 
stopping rules. Sampling schemes and signaling are not supported per se, but real-time 
notifications may be used to alert participants who use the iPhone. 


MQUEST 


mQuest, developed by the cluetec GbmH (Karlsruhe, Germany) was originally intended 
for the use in both marketing surveys and research. mQuest offers many possibilities for 
presenting items, as well as options to capture sounds (e.g., to record spoken commentar- 
ies) or to take photos that are stored on the mobile device. mQuest comes with an easy- 
to-use desktop application to program question sets and to retrieve data. However, the 
program lacks flexibility with regard to implementing and combining different sampling 
schemes and does not include time stamping of data entry by default. Adapted versions 
that resolve these issues are available upon request (see Kubiak, Jonas, & Weber, 2011). 
The software runs on Windows Mobile devices. 


MYEXPERIENCE MOVISENS EDITION 


This software is based on the free MyExperience software described earlier and is devel- 
oped by movisens GmbH (Karlsruhe, Germany). It adds several new features that include, 
foremost, capabilities for cognitive testing and an integrated development environment 
for programming. Moreover, the movisens developers tried to build on the open archi- 
tecture of the software with a particular focus on the integration of other signals (e.g., 
activity) to foster its use as a part of an integrated multichannel system. MyExperience 
movisens Edition has not been released to the public yet. Thus, pricing information can- 
not be given. 


PENDRAGON FORMS 


Pendragon Forms by Pendragon Software (Libertyville, IL) is more a software develop- 
ing and widget toolkit than a dedicated software package for experience sampling. The 
software offers a programming environment, a rich set of widgets for data entry, and 
rudimentary control flow features. At the time of this writing, two versions are avail- 
able: Version 5.1 of Pendragon Forms builds on Windows Mobile and Palm OS, whereas 
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the current Version 6 runs on iOS and Android only, using a server/smartphone client 
architecture. Pendragon Forms is one of the most flexible solutions when it comes to 
questionnaire layout and item formatting; hence, it is often used in computerized experi- 
ence sampling research. However as of Version 5.1, implementing sampling schemes can 
be difficult, because signals have to be programmed within the device’s generic calendar, 
which then triggers the data entry. 


SODA 


SODA, Techneos’s new electronic diary solution, is for mobile clients who use Black- 
Berry OS, Android, or iOS. It is similar to and as powerful as Entryware in terms of 
customization and adds several new features, such as geotagging. SODA is essentially a 
server-based/smartphone client solution, with an architecture similar to iFormBuilder, 
Pendragon Forms IV, or the EpiCollect project. 


Helpful Hints for Implementing 
Computerized Experience Sampling 


After software and a platform are chosen, one then implements one’s study. Generally, 
this task is not too difficult with current user-friendly software. However, some pitfalls 
have to be kept in mind. In this section we give some hints from our own experience for 
implementing computerized experience sampling studies. 


Diary Design 


Naturally, general principles of diary design also hold for computerized experience sam- 
pling (see Conner & Lehman, Chapter 5, this volume). One particular strength of comput- 
erized sampling is the availability of control flow features, such as conditional branching 
between questions. Given the ease with which these can be implemented, many research- 
ers new to the field tend to overuse them. This may lead to overly complex protocols that 
possibly hamper compliance and data integrity (see Barta et al., Chapter 6, this volume). 
Keep in mind that one uses the advanced features only where appropriate. 


Increasing Commitment 


Equipment for experience sampling is still expensive and, naturally, researchers want 
participants to handle the devices with care. From a legal perspective, participants are 
not liable for any damage to the devices while they take part in a study (a participant 
may be held liable only if he or she does not return the device at all and simply keeps it). 
However, in our experience, it is helpful to hand out to participants an information sheet 
on how to handle the device properly and have them sign a statement of commitment, 
which increases compliance. 


Liability for Inadequate Use 


If one uses an Internet-enabled device and does not lock out the device’s regular func- 
tions, such as the Web browser, one should be sure to have participants sign a form 
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indicating that they are liable for what they do on the Internet (visiting websites, down- 
loading music, etc.). 


Technical Failure 


Compared to former generations of mobile devices, current models are robust. However, 
technical failure or loss may still occur. In our own studies, we lost between 0 and 5% of 
the devices in each study, with reasons for loss ranging from broken displays to a hand- 
held device accidentally flushed down the toilet. To ensure that a study runs smoothly it 
is important to have some spare backup devices available. 


Software Failure 


Even well-proven software may crash at times, possibly leading to data loss. In addi- 
tion to thoroughly testing one’s sampling protocol before starting the study, two further 
measures are helpful. First, provide participants with a flowchart-type information sheet 
about what to do in case of software failure. Some experience sampling software can be 
easily restarted, without any data loss, by the participants themselves. Second, establish 
a 24-hour “technical hotline” (e.g., using a prepaid mobile phone carried by oneself and/ 
or one’s staff in shifts). 

In our experience, software problems often arise when the software is run in “lock- 
out” mode, in which all other functions of the mobile device are locked to prevent partici- 
pants from using other built-in features. Even if instructed otherwise, some participants 
may perceive this as a personal challenge and try to circumvent the “lockout” (which, in 
fact, is possible in some way for all software described in this chapter), probably leading 
the experience sampling software to crash. One possible alternative, which we have had 
good experiences in studies with adolescents, is to refrain from using the “lockout” mode 
from the beginning. Although this is suboptimal in terms of protocol standardization, we 
observed a tremendous increase in protocol compliance when participants were allowed 
to use regular features, such as listening to an MP3 media player, and no software fail- 
ures at all. 


Regular Backup 


For solutions that do not transmit data automatically in real time to a server, it is crucial 
to back up the data regularly. We usually see our participants once a week to download 
data recorded so far. In addition, regular appointments with participants are also an easy 
way to address potential problems and to get immediate feedback. 


Final Remarks 


In this chapter we have provided an overview on computerized solutions for the sam- 
pling of experiences and behavior. As we mentioned in the beginning, the field is rapidly 
evolving and by the time this volume is published, more software likely will have become 
available, also introducing new features. To provide an up-to-date review of software 
solutions, an accompanying website set up at the homepage of the Society for Ambula- 
tory Assessment (www.ambulatory-assessment.org) is regularly updated. 
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We conclude this chapter with some speculative thoughts and considerations con- 
cerning future developments of computerized experience sampling. We have already men- 
tioned some trends that, in our view, will hold for the near future, including user-friendly 
software, integration within multichannel systems, and (Inter)net-based services. 

The latter point, connectivity, is where the most profound changes in methods are 
to be expected. This development is linked to a shift from dedicated devices (i.e., hand- 
held computers or personal digital assistants [PDAs]) to generic devices, such as Internet- 
enabled smartphones. While “first-generation” solutions for computerized experience 
sampling using stand-alone mobile devices have introduced valuable features such as 
exact time stamping of data entry, control flow features, or even in-field cognitive test- 
ing, the “second generation” of Internet-enabled smartphone devices offers a vast range 
of new opportunities, such as access to Web-based services, geotagging, real-time data 
acquisition and feedback, and audiovisual recording and streaming. From a technical, 
not a conceptual, perspective, boundaries between experience sampling and Internet sur- 
vey and end-of-day diaries may blur (see Gunthert & Wenze, Chapter 8, this volume). 
There are already projects that link computerized experience sampling to generic, net- 
based services provided by Google (e.g., EpiCollect) or Twitter (e.g., FlowingData, 2010; 
myflowingdata.com), or provide thematic experience sampling Applets for specific fields 
(e.g., Signal Patterns Labs: www.signalpatternslabs.com). 

We predicted earlier in this chapter that conventional stand-alone mobile devices 
will eventually vanish in the near future. This is mirrored by a similar development in 
the software area, where commercial developers and even some free projects now pur- 
sue server-based/smartphone client solutions. At first glance, one might think that this 
development could jeopardize the future use of computerized experience sampling due 
to increased costs for smartphones and other mobile devices with Internet capabilities. 
We are, however, optimistic with regard to this development for three reasons. First, 
relatively low-cost hardware still exists, such as sophisticated multimedia players (e.g., 
the iPod Touch, for which experience sampling software is already available) or por- 
table gaming consoles (to which, unfortunately, experience sampling software has not 
yet been ported). Second, the cost for smartphones with and without a telecommunica- 
tions provider contract is dropping constantly. Third, and in our view most important, 
is the widespread use of smartphones in the general population. Currently, smartphone 
sales are steadily rising, with a parallel decline in conventional mobile phones (Garnter 
Incorporated, 2010). An ad hoc survey of our undergraduate students (N = 117) showed 
that 89% of the students already owned an iOS- or Android-based smartphone. Other 
populations certainly differ in terms of smartphone use but, at least in student popula- 
tions, relying on a server-based/smartphone client architecture is an appealing option, 
particularly given the fact that access to mobile clients is usually free of charge for par- 
ticipants’ own smartphones. 

The future prospects of computerized experience sampling lie in generic smartphone 
devices and these architectures. The vast range of new features and applications is very 
promising. Experience sampling research is just beginning to explore these new possibili- 
ties of “second-generation” computerized experience sampling. 


Authors’ Note 


Trademarks have been omitted throughout the chapter. The absence of trademarks does not imply that 
the names of the products described are unprotected. 


142 STUDY DESIGN CONSIDERATIONS AND METHODS OF DATA COLLECTION 


References 


Aanensen, D. M., Huntley, D. M., Feil, E. J., al-Own, F., & Spratt, B. G. (2009). EpiCollect: Linking 
smartphones to web applications for epidemiology, ecology and community data collection. PLoS 
ONE, 4(9), e6968. 

Beal, D. J., Trougakos, J. P., Weiss, H. M., & Green, S. G. (2006). Episodic processes in emotional 
labor: Perceptions of affective delivery and regulation strategies. Journal of Applied Psychology, 
91, 1053-1065. 

Bernhardt, J. M., Usdan, S., Mays, D., Martin, R., Cremeens, J., & Jacob Arriola, K. (2009). Alcohol 
assessment among college students using wireless mobile technology. Journal of Studies on Alco- 
hol and Drugs, 70, 771-775. 

Bruce, B., Fries, J. F., Ambrosini, D., Lingala, B., Gandek, B., Rose, M., et al. (2009). Better assessment 
of physical function: Item improvement is neglected but essential. Arthritis Research and Therapy, 
11, R191. 

Ebner-Priemer, U. W., & Kubiak, T. (2007). Psychological and psychophysiological ambulatory moni- 
toring: A review of hardware and software solutions. European Journal of Psychological Assess- 
ment, 23, 214-226. 

Ebner-Priemer, U. W., & Sawitzki, G. (2007). Ambulatory assessment of affective instability in border- 
line personality disorder: The effect of the sampling frequency. European Journal of Psychological 
Assessment, 23, 238-247. 

Fahrenberg, J., Bruegner, G., Foerster, F., & Kaeppler, C. (1999). Ambulatory assessment of diurnal 
changes with a hand-held computer: Mood, attention and morningness-eveningness. Personality 
and Individual Differences, 26, 641-656. 

Fahrenberg, J., Hüttner, P., & Leonhardt, R. (2001). MONITOR: Acquisition of psychological data 
by a hand-held PC. In J. Fahrenberg & M. Myrtek (Eds.), Progress in ambulatory assessment 
(pp. 93-112). Seattle, WA: Hogrefe & Huber. 

Feldman Barrett, L., & Barrett, D. J. (2001). An introduction to computerized experience sampling in 
psychology. Social Science Computer Review, 19, 175-185. 

Feldman Barrett, L., & Barrett, D. J. (2005). ESP, the experience sampling program. Retrieved January 
1, 2010, from www.experience-sampling.org. 

FlowingData. (2010). yourFlowingdata project. Retrieved January 1, 2010, from your.flowingdata. 
com. 

Free Software Foundation. (2007). GNU General Public License. Retrieved January 1, 2010, from 
www.gnu.org/licenses/gpl-3.0.html. 

Froehlich, J. (2007). MyExperience documentation. Retrieved January 1, 2010, from myexperience. 
sourceforge.net. 

Froehlich, J., Chen, M., Consolvo, S., Harrison, B., & Landay, J. (2007, June). MyExperience: A system 
for in situ tracing and capturing of user feedback on mobile phones. Presented at MobiSys 2007, 
San Juan, Puerto Rico. Available at myexperience.sourceforge.net. 

Gartner Incorporated. (2010). Press release, May, 19, 2010. Retrieved July 1, 2010, from www.gartner. 
com/it/page.jsp?id=1372013. 

Intille, S. S. (2007). Technological innovations enabling automatic, context-sensitive ecological momen- 
tary assessment. In A. A. Stone, S. Shiffman, A. A. Atienza, & L. Nebeling (Eds.), The science of 
real-time data capture (pp. 308-337). Oxford, UK: Oxford University Press. 

Intille, S. S., Rondoni, J., Kukla, C., Anacona, I., & Bao, L. (2003) A context-aware experience sam- 
pling tool. In Proceedings of CHI ‘03 extended abstracts on human factors in computing systems 
(pp. 972-973). New York: ACM. 

Kubiak, T., & Jonas, C. (2007). Applying circular statistics to the analysis of monitoring data: Patterns 
of social interactions and mood. European Journal of Psychological Assessment, 23, 227-237. 

Kubiak, T., Jonas, C., & Weber, H. (2011). Ruminative thinking in daily life. Manuscript submitted 
for publication. 

Reicherts, M., Salamin, V., Maggiori, C., & Pauls, K. (2007). The Learning Affect Monitor (LAM): A 
computer-based system integrating dimensional and discrete assessment of affective states in daily 
life. European Journal of Psychological Assessment, 23, 268-281. 

Siewert, K., Antoniw, A., Kubiak, T., & Weber, H. (2011). The more the better?: The relationship 
between mismatches in social support and subjective well-being in daily life. Journal of Health 
Psychology, 16, 621-631. 


Computerized Sampling of Experiences and Behavior 143 


Trougakos, J. P., Beal, D. J., Green, S. G., & Weiss, H. M. (2008). Making the break count: An episodic 
examination of recovery activities, emotional experiences, and positive affective displays. Acad- 
emy of Management Journal, 51, 131-146. 

Tugade, M. M., Fredrickson, B. L., & Feldman Barrett, L. (2004). Psychological resilience and positive 
emotional granularity: Examining the benefits of positive emotions on coping and health. Journal 
of Personality, 72, 1161-1190. 

Weiss, H. M., Beal, D. J., Lucy, S. L., & MacDermid, S. M. (2004). Constructing EMA studies with 
PMAT: The Purdue Momentary Assessment Tool user’s manual. Retrieved January 1, 2010, from 
www.ruf.rice.edu/~dbeal/pmat.html. 

Wilhelm, P., & Schoebi, D. (2007). Assessing mood in daily life: Structural validity, sensitivity to 
change, and reliability of a short-scale to measure three basic dimensions of mood. European 
Journal of Psychological Assessment, 23, 258-267. 


CHAPTER 8 


Daily Diary Methods 


KATHLEEN C. GUNTHERT 
SUSAN J. WENZE 


few decades ago, the literature on stress and adjustment was dominated by work that 

focused on the effect of major life events on health and psychological functioning. 
This made some intuitive sense because major life events are often tremendously disrup- 
tive and anxiety provoking, and they seem to require our best coping efforts. Smaller 
stressors, such as a minor conflict with a partner or a child refusing to eat dinner, cer- 
tainly seem less significant, and were mostly ignored in the stress literature. 

However, in the early 1980s, researchers started to challenge these assumptions about 
the importance of the ins and outs of everyday life. In one of the first papers comparing the 
effects of major life events and minor daily hassles, DeLongis, Coyne, Dakof, Folkman, 
and Lazarus (1982) showed that everyday hassles are more predictive of health symp- 
toms than are major life events. Extending this work, Pillow, Zautra, and Sandler (1996) 
showed that major life events tend to exert their influence through everyday stressors. 
These studies represented a shift in how we conceptualize stress, vulnerability, and out- 
comes. Major life stress, while important, is relatively infrequent. The cumulative effect 
of negative reactions to minor challenges, day in and day out, can make us vulnerable to 
health problems and psychological maladjustment (e.g., Massey, Garnefski, Gebhardt, & 
van der Leeden, 2009; O’ Neill, Cohen, Tolpin, & Gunthert, 2004). This emerging empha- 
sis on everyday functioning has necessitated a reliance ona different approach to research, 
one that involves tracking people as they negotiate their day-to-day lives. The daily diary 
methodology became an important tool in research on stress, emotion, and health. 

There is not consistent terminology regarding daily diary designs, also commonly 
called daily process designs. Many people use the term daily diary study to describe a 
once-per-day assessment approach. It is not uncommon, though, for people to use the 
term diary study to refer to a design in which individuals are prompted multiple times per 
day to respond (e.g., Mor et al., 2010). For clarity in this chapter, we define a daily diary 
design as one in which participants are assessed once per day over a defined period (rang- 
ing from a few days to months). For our purposes, studies that entail multiple assessments 
at different times of the day are called experience sampling studies or ecological momen- 
tary assessment studies. 
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1985-1989 1990-1994 1995-1999 2000-2004 2005-2009 


FIGURE 8.1. Increase in research studies using daily diary methods over time. Note: Number of 
studies estimated from a PsycINFO search using the keywords daily diary, daily diaries, or daily 
process, and limiting the search to research with human participants published in journals. This 
search likely missed studies that used other terminology (e.g., assessed daily or daily log), but the 
general trend is clear. 


In the middle to late 1990s, the daily diary method quickly gained in popularity in 
not only the stress literature but also the social/personality/clinical/health literatures in 
general. In fact, Figure 8.1 illustrates the drastic change in the use of this methodology 
over the past 15 years. 

The increased use of daily diary methods probably occurred for a number of rea- 
sons. Clearly we had research questions that were best answered with intensive daily 
monitoring designs. However, researchers recognized the advantages of obtaining a rich 
dataset with many observations per person, allowing us to study simultaneously both 
within- and between-person sources of variability. Fortunately, advances in statistical 
techniques also helped researchers to capitalize on the advantages of intensive repeated 
measures data, along with user-friendly explanations of those techniques (e.g., Nezlek, 
2001). Finally, these designs became relatively easy to administer with the increased ease 
of Web-based data collection. 

In this chapter, we review the advantages and disadvantages of daily diary designs, 
the types of research questions that are particularly well suited for a daily diary method, 
and design considerations for researchers who are planning a daily diary study. Finally, 
we offer some suggestions for future research using the daily diary method. 


Advantages of Daily Diary Methods 
over Less Intensive Monitoring 


Many chapters in this handbook outline the general advantages to frequent monitoring 
of behaviors, cognitions, moods, and symptoms (for an overview of the advantages of 
repeated intensive monitoring, see Reis, Chapter 1, this volume). Some of the frequently 
cited advantages are accessing individuals outside the laboratory in their everyday envi- 
ronments (increased ecological validity), minimization of recall bias, and the ability to 
capture within-person processes and within-person variability (e.g., Bolger, Davis, & 
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Rafaeli, 2003; Moskowitz, Russell, Sadikaj, & Sutton, 2009). Given the in-depth cover- 
age of this material in other chapters, we provide only brief comment on some of these 
advantages. 

As discussed by Schwarz (Chapter 2, this volume), cognitive processing and memory 
errors likely decrease when participants are asked to reflect on short periods of time 
that are very recent. Retrospective questionnaires that ask participants to aggregate their 
experiences over time are influenced by a number of systematic recall biases. Participant 
responses tend to be overly influenced by current mood state, the most salient and most 
recent experiences, and beliefs about the self and how one typically behaves (Bolger et 
al., 2003; Moskowitz et al., 2009). Importantly, we have to keep in mind that recall bias 
might be systematically related to our variables of interest. For example, in our work, we 
have found that depression is associated with recall bias in systematic ways: Depressed 
participants tend to overrecall negative mood, whereas nondepressed individuals tend to 
overrecall positive mood (Wenze, Gunthert, & German, 2009). If, then, we ask for ret- 
rospective mood assessments, nondepressed individuals will look even better than actual 
experience, and depressed individuals will look even worse than actual experience, and 
this bias might artificially inflate relationships of interest. Therefore, daily diary studies 
are particularly useful in the case where recall bias is systematically related to variables 
of interest. 

Researchers are also drawn to daily diary designs for their ability to capture within- 
person processes as they unfold over time. Hamaker (Chapter 3, this volume) provides a 
good discussion of the benefit afforded by within-person (also called idiographic) designs. 
Using a within-person analysis, we are able to see how people react relative to their own 
baseline, when specific situations arise (Bolger et al., 2003). At the between-person level, 
we can then predict variation in these within-person “reactivity” indices as a function of 
individual-difference variables. Within-person analyses of repeated measures data also 
allow inferences about the temporal sequence of processes, using time-lagged analyses 
(e.g., Stader & Hokanson, 1998). So we can see whether changes in variable X are fol- 
lowed in time by changes in variable Y. This allows us to model not only initial reactions 
to specific situations, as reported concurrently, but also the extent to which these reac- 
tions linger into the subsequent day(s) (Gunthert, Cohen, Butler, & Beck, 2007). 

It is noteworthy that within-person variability itself is often an important individual- 
difference variable captured in daily diary designs (Bolger et al., 2003). For example, 
greater variability in daily self-esteem has been associated with posttraumatic stress dis- 
order (Kashdan, Uswatte, Steger, & Julian, 2006). This within-person variability, which 
in this case captures important information about self-esteem instability, is evident only 
through repeated assessments over time. 


Advantages of Daily Diary Methods over Ecological 
Momentary Assessment: Sometimes Less Is More 


Of course, all of the previously mentioned advantages also apply to more intensive 
designs, such as ecological momentary assessment (EMA) studies, in which participants 
are prompted multiple times per day to report on their states and behaviors. But there 
are also costs associated with more intensive measurement. Sometimes a daily assess- 
ment approach hits that right balance between too little and too much assessment. First, 
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compliance rates might be better when participants only need to respond once per day 
compared to multiple times per day. Second, as we discuss later, some variables are simply 
better suited to a once-daily assessment approach. Third, daily diary studies sometimes 
offer greater ease and affordability in implementation. With Internet access so prevalent, 
it is quite easy to send a daily link to participants to complete Web-based surveys. For 
most of the EMA studies we conducted, the process of accessing participants in real time 
was a bit more complicated (e.g., programming personal digital assistant [PDA] devices 
to prompt participants, explaining PDA devices to participants, worrying about malfunc- 
tions or PDA loss/damage). PDA devices and data collection software are also typically 
more expensive than Web-based services. Fourth, daily diary studies might be more toler- 
able over longer periods of time compared to assessments that occur multiple times per 
day. For example, Laurenceau, Feldman Barrett, and Rovine (2005) used the daily diary 
method to assess marital communication and intimacy over 42 consecutive days. This 
would be a particularly long time period to do more than once-per-day assessments. 


Disadvantages of Daily Diary Methods 


Of course, there are some downsides to repeated intensive assessment. There has been 
a worry with daily diary studies that the higher burden involved in intensive repeated 
assessment might discourage participation, or result in a sampling bias related to motiva- 
tion and personality factors (Scollon, Kim-Prieto, & Diener, 2003). Generally, though, 
compliance rates with daily diary studies tend to be quite high, even with clinical popula- 
tions (e.g., Cohen et al., 2008; Cranford, Tennen, & Zucker, 2010). Another threat to 
validity of these studies is measurement reactivity, which occurs when the act of repeated 
assessment systematically changes participants’ behaviors or responses (see Barta, Ten- 
nen, & Litt, Chapter 6, this volume). For example, participants might change their daily 
eating behavior if they know that they have to report on it, they might be very careful 
with responses in the first few assessments and get more careless as boredom with the 
study sets in, or they might develop more insight through self-reflection as the study 
progresses (Bolger et al., 2003). Finally, the complexity of daily diary datasets often 
requires advanced statistical procedures. There are both within-person and between- 
person sources of error that must be modeled simultaneously, requiring multilevel model- 
ing approaches (see Part III on data-analytic methods). It is not inherently problematic 
to use more advanced statistical procedures, but some researchers might initially find the 
analysis complexities a bit daunting. 

There are also drawbacks of diary designs relative to the more intensive EMA mea- 
surement approaches. Even within the course of a single day, memory biases may still influ- 
ence our recall of mood, pain, behaviors, and other momentary variables. For example, 
when participants are asked to recall their mood over the course of the day, the ratings do 
correlate highly with the average of mood states across the day; however, the mood rat- 
ings tend to be biased in the direction of the “peak” or more intense moods experienced 
(Neale, Hooley, Jandorf, & Stone, 1987). However, the recall bias that occurs in one day 
is probably small relative to the degree of recall bias in more traditional retrospective 
designs that ask participants to aggregate experiences over weeks, months, or more. In 
addition, some processes, such as cognition and mood, can fluctuate greatly within a day, 
and that within-day fluctuation (and associated correlates of the fluctuation) is lost in the 
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once-daily assessments. Finally, when we assess people once per day, they tend to respond 
to our surveys in more limited contexts, such as their home computers. EMA studies tend 
to capture individuals’ responses as they are actively engaged in navigating their daily 
environments (at work, at school, while out socializing, etc.). 


Constructs That Are Well Suited for a Daily Diary Method 


The daily diary methodology has been widely embraced because so many of the behav- 
iors, situations, cognitions, moods, and self-concepts in which we are interested fluctuate 
from day to day. Some of the more common constructs that appear in daily diary studies 
are stress (e.g., Cohen et al., 2008), affect (e.g., Connelly et al., 2007), self-esteem (e.g., 
Heppner et al., 2008), coping (e.g., Gunthert, Cohen, & Armeli, 1999), social support 
(e.g., Gleason, lida, Shrout, & Bolger, 2008), interpersonal functioning and conflict (e.g., 
Eberhart & Hammen, 2009), marital functioning and satisfaction (e.g., Murray, Bel- 
lavia, Rose, & Griffin, 2003), physical symptoms (e.g., Connelly et al., 2007), eating 
behaviors (e.g., O'Connor, Jones, Conner, McMillan, & Ferguson, 2008) and alcohol or 
substance use (e.g., DeHart, Tennen, Armeli, Todd, & Mohr, 2009). 

Certain features of variables make them particularly well suited to a daily diary 
design. When there is the presence of substantial within-person variability in constructs, 
or when recall will interfere with the accuracy of reporting over long periods of time, 
daily diary designs are often preferable to less intensive monitoring approaches, such as 
single-administration assessments. But there are also times when constructs are better 
suited to a daily diary method than a more intensive monitoring design, such as EMA. 
So what types of constructs are optimally assessed in a once-per-day assessment for- 
mat? 


Behaviors or Situations That Are Moderately Frequent 


Of course, moderately frequent is a relative term, so we should clarify that we are refer- 
ring to behaviors that do not happen many times over the course of a day but would be 
expected to occur at least a few times over the course of the assessment period. In these 
cases, a daily diary design will likely be the optimum design choice compared to tradi- 
tional designs with longer periods of recall (the diary design offers more accuracy), and 
compared to EMA designs (which might increase participant burden for little additional 
benefit). For example, Chung, Flook, and Fuligni (2009) studied family conflict among 
adolescents of different ethnic backgrounds. Across all groups, family conflict occurred 
fairly infrequently in the 2-week diary phase, but when conflict did occur, there was a 
significant impact on emotional distress. If family conflict is occurring only every few 
days, it might be unnecessary for adolescents to complete multiple assessments through- 
out each day. The processes of interest might be efficiently captured in a once-per-day 
reporting format. 

Another example of a moderately frequent behavior is sexual behavior. In a study of 
college students, O’Sullivan, Udell, and Patel (2006) found that sexual behavior occurred 
on approximately 1 out of every 3 days reported. Given the moderately frequent occur- 
rence of sex behaviors, daily diary studies have been very helpful at documenting the 
everyday perceptual, behavioral, and emotional predictors of unprotected sex behav- 


Daily Diary Methods 149 


iors (Kiene, Barta, Tennen, & Armeli, 2009; Mustanski, 2007; O’Sullivan et al., 2006). 
Importantly, research has shown that longer recall of high-risk sexual behavior is biased 
by beliefs about situational determinants of the behavior (Leigh, Morrison, Hoppe, Bead- 
nell, & Gillmore, 2008). Thus, the daily reporting of sex behaviors and relevant ante- 
cedents is critical. Other moderately frequent behaviors and symptoms in some popula- 
tions include headaches and pain (Edwards, Almeida, Klick, Haythornthwaite, & Smith, 
2008; Massey et al., 2009) and interpersonal conflict (Suls, Martin, & David, 1998). In 
summary, some behaviors occur frequently enough that they are difficult to aggregate in 
memory over long periods of time, but infrequently enough that they do not need to be 
assessed many times per day (Moskowitz et al., 2009). 


Behaviors or Situations That Occur at Certain Times of the Day 


Behaviors that occur predictably at certain times of the day are also probably well suited 
to a daily diary format. For example, sleep does not usually need to be assessed multiple 
times across the day, nor does alcohol use, in normal populations. Certainly, daytime 
napping occurs, but it is likely to be easily remembered on the once-daily reports of 
sleep. 


Behaviors or Situations That Are More Resistant 
to Within-Day Recall Bias 


Some of our everyday behaviors and situations are probably more susceptible to recall 
bias, even over short periods of time. Fleeting subjective experiences are likely more vul- 
nerable to recall bias than are discrete occurrences/behaviors (Tennen, Affleck, Coyne, 
Larsen, & DeLongis, 2006). For example, in our work, we have assessed everyday nega- 
tive and positive cognitions (Wenze, Gunthert, & Forand, 2007), in which participants 
endorse the degree to which they have had certain thoughts since the last assessment. 
The task of recalling and aggregating thought processes over the course of a day seems 
difficult, so we chose to use an EMA paradigm (participants assessed four times per day) 
for this construct. In contrast, constructs that require reporting whether a concrete situ- 
ation occurred are likely less susceptible to recall bias. In the assessment of stress, for 
instance, when participants are asked whether specific events happened, most individuals 
will probably be able to recall accurately whether an event occurred. It is more difficult 
for them to recall and quickly aggregate the complex patterns of emotions and coping 
that occurred over the course of dealing with the stressor. 


Medical Monitoring and Compliance Behaviors 


The field of health psychology has capitalized on daily diary methods because they allow 
tracking of symptoms and compliance behaviors. This is particularly important, because 
research has demonstrated recall biases in the reporting of symptoms over longer peri- 
ods of time, and treatment decisions might be based on such information (McKenzie & 
Cutrer, 2009). Daily diaries can also help to document treatment efficacy (e.g., Whalen 
et al., 2010), and to identify the predictors of daily variation in symptoms (Skaff et al., 
2009). One area that has seen a lot of this type of research is chronic pain disorders, 
such as rheumatoid arthritis and fibromyalgia. Researchers have made good progress in 
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understanding the antecedents and consequences of pain episodes though the use of daily 
symptom tracking (e.g., Connelly et al., 2007). 


Design Considerations 


When designing a daily diary study, there are a number of decision points regarding the 
structure of the diary and the measurement of constructs. In this section, we try to pro- 
vide concrete advice on constructing the daily diary assessment. 


Time Period of Assessment 


Most daily diary studies range from about 7 to 30 days, with some lasting as long as a 
year or more (Fals-Stewart, 2003). The modal assessment period is probably about 2 
weeks. Of course, the optimal time period of assessment is dependent on the constructs of 
interest. The more often the constructs of interest occur in the population, the greater the 
likelihood of catching multiple instances in a shorter time period. Minor levels of stress 
tend to occur on most days, so one can often get acceptable variability in stress levels over 
the course of 2 weeks (e.g., Gunthert et al., 1999). Researchers who study less frequent 
behaviors/situations sometimes choose a longer assessment period. For example, Fals- 
Stewart (2003) assessed partner physical aggression (among those with domestic violence 
history) for a period of 15 months. 

A less frequently used strategy is to sample days over longer periods of time. In their 
study of family predictors of nighttime waking among children with asthma, Fiese, Win- 
ter, Sliwinski, and Anbar (2007) assessed families four times a week, every 3 months, 
for a period of 1 year. This strategy is probably underused in daily diary studies; it is 
certainly one that could increase generalizability. If we sample days across longer periods, 
we are likely to assess people under more varied circumstances and symptom states. 


Optimal Length of Assessments 


We are not aware of any studies that test the optimal length for daily assessment mea- 
sures, and surely this might vary based on populations. At what point will the length 
of the questionnaire contribute to attrition, missing data, or increased random (bored) 
responding? As of now, we know of little data addressing these issues. As a general rule, 
however, we try to keep our assessments to under 10 minutes. 


Measurement Issues 


Although the daily diary methodology has gained widespread popularity, researchers are 
still routinely grappling with lack of appropriate measures for quick, everyday assess- 
ment. We have been slow to develop measures that have been put through the typical 
tests of reliability and validity that are expected in traditional self-report designs. Some 
assessment instruments, however, were constructed for daily diary use (with attention to 
issues of reliability and validity). Some examples are the Rochester Interaction Record 
(Wheeler & Nezlek, 1977) for assessment of everyday social exchanges, Stone and Neale’s 
(1984) daily coping assessment, the Social Behavior Inventory (Moskowitz, 1994), the 
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newer Daily Assessment of Symptoms-Anxiety (DAS-A; Morlock et al., 2008), a three- 
dimension mood scale (Wilhelm & Schoebi, 2007), and a computerized mobile phone 
daily mood assessment (Courvoisier, Eid, Lischetzke, & Schreiber, 2010). 

Generally, researchers attempt to abbreviate longer measures designed for retrospec- 
tive or trait-level assessments. Researchers typically want to use only a few items to cap- 
ture each construct, so they must decide how to find a short and representative group of 
items from larger scales. One approach is to use the items with the highest factor load- 
ings on a scale, if such information is available (e.g., Gunthert et al., 2007). But it will 
be important to ensure that this approach does not generate a small scale that overem- 
phasizes some aspects of a construct, and underemphasizes others. Additionally, when 
reducing trait-level scales to daily scales, it is important to pay attention to the internal 
reliability of the original scale. When scales have high internal reliability, there is reason 
to believe that one can scale back a bit on the items. However, if internal reliability is 
modest, one can have less confidence that two or three items will assess the construct 
of interest. Of course, reliability of item responses on the trait level does not necessarily 
reflect reliability of the items as assessed on a daily level. 

Fortunately, with the growing number of daily diary studies, more and more con- 
structs have been studied on the daily level. Researchers might do well to look at the 
scales already generated, including reliability and predictive validity, and use these daily 
measures as a starting point. However, because some of these scales do not have much 
information on scale development or psychometrics, researchers need to think critically 
about the scales and how well they assess the constructs of interest, and work toward 
establishing reliability and validity of the measures (see Shrout and Lane, Chapter 17, 
this volume). 


Importance of Reminders 


We have found that daily reminders increase the likelihood of compliance. For Web-based 
studies, researchers might send an e-mail with the link each day, so that participants can 
easily access the site without searching through older e-mails. 


Paper-and-Pencil versus Electronic Data Collection 


Conner and Lehman (Chapter 5, this volume) outline some of the advantages and dis- 
advantages of paper-and-pencil versus electronic formats. We briefly comment on these 
issues as well. For a more thorough discussion of the advantages and disadvantages of 
these formats, see the 2006 series of articles on the issue in Psychological Methods (Green, 
Rafaeli, Bolger, Shrout, & Reis, 2006; and corresponding commentaries). 
Paper-and-pencil formats are inexpensive and can be easy to administer. They 
require little participant training. However, some researchers have expressed concerns 
that participants fail to complete measures as directed, with participants completing them 
significantly after the correct time frame or in batches. One of the more creative stud- 
ies of compliance rates in paper-and-pencil studies was by Stone, Shiffman, Schwartz, 
Broderick, and Hufford (2003). The authors gave participants paper questionnaires to 
complete at specific times (three times per day for 21 days), but unbeknownst to partici- 
pants, the binder that held the surveys had a photosensor that electronically recorded 
the times that the binder was opened. Results indicated a high rate of faked compliance. 
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Participants reported approximately 90% adherence with the daily diary entries. How- 
ever, time-stamped data indicated that actual compliance was closer to 11%. There were 
some participants who even completed diary entries ahead of time. In the same study, 
the compliance rate for an electronic diary was 94%. This comparison might not be a 
completely fair one, though, given that the participants in the electronic diary condition 
were prompted by alarms at the beginning and end of the diary window at each assess- 
ment and told that compliance was being monitored. The paper-and-pencil participants 
might have had somewhat higher compliance had they been given the same prompts and 
monitoring information (Green et al., 2006). In contrast to the Stone and colleagues 
study (2003), Green and colleagues (2006) conducted a study of paper-and-pencil dia- 
ries and found that reported completion times generally matched randomized signaling 
times (note that diaries were completed multiple times per day), and that data obtained 
through paper-and-pencil questionnaires were largely psychometrically equivalent to 
data obtained through electronic assessment. 

Some researchers argue that paper-and-pencil diary compliance might be enhanced 
by augmenting the design with reminder alarms (e.g., watch alarms, phone calls), foster- 
ing a sense of the importance of participants’ involvement in the study, and maintaining 
regular contact (Bolger et al., 2003; Green et al., 2006). Researchers can also bolster con- 
fidence in the accuracy of completed diary surveys by taking simple steps, such as having 
participants drop off surveys daily at specified locations (e.g., on a college campus; Stader 
& Hokanson, 1998), or by providing stamped envelopes where surveys are mailed and 
postmarked on a daily basis (Tennen et al., 2006). Alternatively, Chung and colleagues 
(2009) asked participants to place the surveys in an envelope, then had them stamp the 
seal with a handheld electronic time stamp. With these augmented efforts to increase and 
verify compliance, researchers are likely able to obtain good data with paper-and-pencil 
diaries. 

However, there are some other advantages to using electronic approaches to capture 
daily diary data. As Tennen and colleagues (2006) pointed out, the flexibility of comput- 
erized tasks is appealing; electronic diaries have the ability easily to randomize stimuli 
each day (e.g., experimental studies), and to administer tasks that stray from traditional 
self-report, such as memory tasks and reaction time tasks. 

In the daily methodology, the electronic assessments are most often administered 
in either Web-based or interactive voice response (IVR) protocols. Both of these options 
provide electronic date and time stamps, so that compliance is recorded more accurately. 
Web-based survey software, such as SurveyMonkey, has become very user friendly and 
fairly inexpensive. In many populations, participants tend to be on computers frequently, 
so it is not difficult for them to access the surveys easily. Hence, some of the burden 
to the researcher and to the participant is reduced in daily, Web-based data collection. 
IVR protocols are increasingly common as well, likely due to the availability of a wider 
range of software options (e.g., SmartQ from TeleSage) and the increased reliance on 
cell phones in our culture overall. IVR protocols can be useful when researchers want 
to contact participants each day to facilitate compliance rather than to rely on partici- 
pants to initiate the survey on the Web or on paper. For example, Cohen and colleagues 
(2008) conducted an IVR-based study of depressed outpatients, a population that might 
be lower on motivation in general, so reaching out through a phone call each day might 
have helped with compliance. Also, with the current pervasiveness of cell phones, IVR- 
based surveys might be less limiting in terms of the need to physically be at a computer to 
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complete daily surveys (see Courvoisier et al., 2010, for an IVR example using multiple 
assessments per day). 

Of course, there are some disadvantages to electronic collection approaches. Depend- 
ing on the specifications of the study, they can be more costly than paper-and-pencil 
approaches. However, the Web-based survey programs are currently very affordable, and 
we wonder whether the cost of stamps and photocopying for a paper diary study might 
now be almost comparable to or more expensive than Web-based data collection. Still, 
computers are not accessible to all populations, which is an important limitation. IVR- 
based approaches might require more patience in listening to questions and pressing num- 
bers rather than reading at one’s own pace either online or on paper surveys. IVR-based 
surveys tend to take more time to administer, which means that researchers can typically 
fit fewer questions on an IVR survey than on Web-based or paper surveys. Overall, then, 
there are advantages and disadvantages to each of these approaches that must be consid- 
ered in light of the research question, the laboratory resources, the study budget, and the 
participant characteristics. 


Special Considerations for Future Research 


Fortunately, researchers have embraced the advantages afforded by the daily diary meth- 
odology and other real-time data capture techniques. Throughout this handbook, there 
are examples of how the methodology is evolving through creative advancements in tech- 
nology, statistics, and conceptualization of some of our common constructs. In the next 
section, we identify a number of methodological and conceptual directions that might 
also move us forward in our use of daily diary data. First, as a field, we need to study psy- 
chometric features of the type of data we are collecting. Second, there are ways we might 
be more creative with how we think about and use the data we are collecting. 


Stability of Scales over Time 


Although there are some procedures for estimating internal consistency in daily diary 
studies (e.g., see Nezlek & Gable, 2001; see also Nezlek, Chapter 20, this volume), there 
has been little attention to the stability of scales over the course of repeated daily observa- 
tions. It is possible that participants start responding with a particular response set over 
time, or respond more randomly over time, such that internal reliability would change 
over the course of the study. It is certainly problematic if we do a better job of measuring 
constructs at specific times in the study. Cranford and colleagues (2006) have laid out 
procedures for decomposing variance in a way that more accurately assesses reliability, 
along with testing how reliably scales measure change across the course of the study (see 
also Shrout & Lane, Chapter 17, this volume). Generally, however, we need to pay more 
attention to the ways that scale use might change with frequent administration. 


Stability of Within-Person Processes 


One of the noted strengths of daily diary designs is the ability to capture within-person 
processes. For example, researchers have demonstrated that emotional reactivity, or a 
stronger within-person association between daily stress and negative affect, is associated 
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with neuroticism (Bolger & Zuckerman, 1995; Gunthert et al., 1999) and age (Mroczek 
& Almeida, 2004). When we research within-person processes, we tend to assume that 
we are capturing a “typical” within-person slope for each individual. That assumption 
holds only if within-person processes are at least somewhat stable over time. However, 
Sliwinski, Almeida, Smyth, and Stawski (2009) argued that these reactivity indices might 
vary not only between but also within persons over time. They noted that “studies typi- 
cally do not examine this possibility, instead treating reactivity as a fixed (invariant) 
characteristic of the person” (p. 829). The Sliwinski and colleagues study is one of the 
few longitudinal studies to address this issue. They used two measurement-burst designs 
(repeated sequences of daily assessment) to assess stability in within-person relationships 
between stress and negative affect. In one middle-aged sample of adults, the correlation 
between reactivity indices across a 10-year time span was .37. Stability was greatest 
for individuals in their 30s (r = .62) and decreased throughout the lifespan. The second 
study of older adults also suggested not only some stability over time but also substantial 
within-person variability in reactivity process. Given the many daily diary studies on the 
within-person associations between stress and distress (or other relevant outcomes), we 
need more research testing how consistent these processes are over time, along with the 
characteristics of people who display more or less consistency. 


Predictors of Missing Data 


Missing data are an expected nuisance for daily diary researchers, since most participants 
do not complete every assessment. The problem is that the occurrence of missing data 
might be systematically affected by both situational and individual-difference variables. 
Are participants more likely to skip an assessment on days when they have more pain, 
more negative mood, less sleep, or more stress? At the individual-difference level, a num- 
ber of factors could influence completion of daily diaries, including motivation, consci- 
entiousness, and level of functioning in clinical samples (Scollon et al., 2003). Jiang and 
colleagues (2009) found that among asthma patients, anxiety was related to lower compli- 
ance with daily symptom reports (note that diary entries in this case took place three times 
per day). Although a few studies do report on systematic differences in rates of missing 
data, there is a need for more research on this issue. For a more in-depth discussion of 
missing data issues, see Black, Harel, and Matthews (Chapter 19, this volume). 


Cumulative Effects over Days 


In addition to these methodological questions, researchers will undoubtedly become more 
creative in handling daily diary datasets to address more adequately our research ques- 
tions. For example, although many studies investigate within-day and cross-day effects 
of important psychosocial variables, very few researchers have extended analyses beyond 
the effects of a day or two. One would imagine that a stressor might elicit a different 
response if it followed 3 good days in a row compared to following 3 days of other terrible 
stressors. Grzywacz and Almeida (2008) recently argued that daily process studies have 
neglected the issue of accumulated stress. They operationalized stressor “pileup” as the 
number of days, over the past 3 days, that a stressor was reported. Their research showed 
that the cumulative burden of stressor pileup was associated with an increased risk of 
binge drinking, independent of the stress of that same day. Hamilton and colleagues 
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(2008) also used an accumulation approach to study the effects of sleep disruption in 
women with fibromyalgia. Sleep debt (the number of previous nights in a row the par- 
ticipant reported less than 6 hours of sleep) was associated with higher and higher levels 
of negative affect as the sleep deprivation accumulated. It strikes us that a number of our 
common constructs in daily diary studies might fit a cumulative buildup pattern, and it 
would be interesting to see more researchers testing such effects. 


Within-Person Processes as Predictors of Long-Term Outcomes 


In many daily process studies, we look at how people differ in within-person relationships 
between two variables, so the within-person relationship essentially becomes an outcome 
(the dependent variable). However, sometimes within-person processes that we are able 
to capture with diary methods might also be conceptualized as vulnerability factors (pre- 
dictor variables). How individuals react to situations (e.g., within-person slope between 
stress and coping or self-esteem) might render them vulnerable to problematic outcomes 
in the face of larger life stress. Hence, the within-person process is potentially an indepen- 
dent variable that is useful in predicting some other long-term outcome. Some examples 
of this approach are Butler, Hokanson, and Flynn (1994), who showed that self-esteem 
lability (stronger relationship between daily events and self-esteem) was associated with 
an increased risk for depression symptoms, and Cohen and colleagues (2008), who used 
within-person processes as prognostic indicators of poor treatment response in cogni- 
tive therapy for depression. Still, there are few researchers using daily process-oriented 
variables (within-person slopes) as predictors of longer-term outcomes. Of course, this 
work is somewhat dependent on establishing the stability of within-person processes, as 
described earlier. 


Nonlinear Effects 


Most researchers make assumptions about the linearity of effects between within-person 
variables. For example, in the relationship between stress and mood, most researchers 
calculate a linear slope representing each person’s unique relationship between stress 
and mood. However, many psychosocial processes do not follow a linear pattern (Burke, 
Shrout, & Bolger, 2007). It strikes us that many researchers (often including ourselves) 
do not investigate the possibility of nonlinear effects in the associations between within- 
person variables. It is possible, for example, that the effect of interpersonal conflict on 
self-esteem is curvilinear; the first big conflict of a day might damage self-esteem, but each 
successive conflict might not have an equivalent effect. In other words, the person might 
experience a big drop in self-esteem after one or two conflicts, then only small drops in an 
already low score with subsequent conflicts (or certainly the opposite could happen when 
people are resilient to one or two conflicts, but when they have three or four, their self- 
esteem really starts to plummet). Furthermore, the linearity of relationships might differ 
according to individual-difference factors. Some people might have a very linear relation- 
ship between conflict and esteem, and others might follow more of a threshold pattern, 
in which esteem does not change much until a certain level of interpersonal stress. Daily 
diary designs are well suited to test these sort of curvilinear within-person effects and 
the between-person differences that moderate such effects (for further ideas on modeling 
nonlinear effects over time, see Deboeck, Chapter 24, this volume). 
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There are certainly a few daily diary studies that have tested for curvilinear patterns 
in within-person relationships. The Hamilton and colleagues (2008) study that tested 
the effects of cumulative sleep deprivation also found a curvilinear relationship between 
sleep duration and fatigue. A quadratic function added to the model indicated that those 
with both very long and very short sleep durations had more fatigue, and moderate lev- 
els of sleep were associated with lower fatigue. Also, Burke and colleagues (2007) have 
demonstrated more advanced statistical methods for handling nonlinear models with 
repeated measures data. Nonlinear mixed-model analysis allows researchers to estimate 
average nonlinear trajectories across samples, as well as simultaneously model between- 
person differences in these trajectories. This approach might facilitate more complexity 
in our ability to model complex relationships between our within-person predictors and 
outcomes in daily diary data. 


Summary and Conclusion 


Daily diary designs are emerging in many different research areas, spanning a range of 
topics in social, clinical, cognitive, health, and industrial/organizational psychology. So 
many of our behaviors, situations, cognitions, emotions, and reactions change from day 
to day, and these designs allow us to model some of the more micro-level processes that 
play out as we navigate our everyday lives. With increasing numbers of researchers using 
daily diary methods, we will continue to see new technological, methodological, and 
statistical advances that make our designs even more powerful and flexible. Importantly, 
though, we also need to think more flexibly about how to use the very complex data 
that result from repeated daily assessment. We have made great progress in our use of 
daily diary methods and data, but there is also much room for continued creativity in the 
design and analysis of daily diary studies. 
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CHAPTER 9 


Event-Contingent Recording 


D. S. MOSKOWITZ 
GENTIANA SADIKAJ 


here is considerable variation in the duration and the discriminability of features of an 

individual and the individual’s environment in everyday life. Some features of poten- 
tial interest, such as smoking a cigarette or a social encounter, have discrete beginnings 
and endings. Other features have beginnings and endings that are less discernible, such 
as a bad mood. Durations of features also vary: The cigarette takes a brief period of time 
to smoke, a matter of minutes, while a bad mood can linger for hours. Understanding the 
nature of a feature, at least with respect to the discernibility of the beginning, is crucial 
for making design decisions in studies using intensive repeated measures in naturalistic 
settings (IRM-NS). 

Three primary designs have been used for the self-report collection of IRM-NSs: 
interval-contingent designs, which provide for the recording of data for specific units of 
time, such as a day; signal-contingent designs, which specify data recording at the occur- 
rence of signals that typically occur randomly throughout a day; and event-contingent 
recording (ECR), in which data recording is completed subsequent to the occurrence of a 
specified event (Wheeler & Reis, 1991; see also Conner & Lehman, Chapter 5, this vol- 
ume). The focus of this chapter is on ECR, which requires the identification of events with 
a discernible beginning and end. The explicit identification of events of interest permits 
the maximization of the collection of data about these focal events, and the minimiza- 
tion of the amount of time between the event and the recording of information about the 
event. This method has been found to be particularly suitable for data collection when 
the focal events are social interactions, and for health-related behaviors, such as smoking 
and sudden exacerbations in symptoms (e.g., acute pain). 

This chapter begins with a brief history of ECR and a description of some common 
forms of ECR. We then provide suggestions for improved practices when using this form 
of intensive repeated measure self-report methodology. We subsequently discuss some 
advantages and disadvantages of the ECR design and consider some issues relevant to 
the reliability of ECR. We conclude with some speculation about how the ECR method 
may evolve. 
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History of ECR 


The use of IRM-NS designs dates back to the beginning of the 20th century with the 
time budget (i.e., time allocation) studies in sociological research (see Bevans, 1913). 
Burns (1954) was one of the first to use an ECR method. He examined the distribution 
of communication time of individuals at the executive level of industrial organizations. 
The event was defined by the membership composition and content of communication 
exchanged between people interacting with each other. 

ECR was later employed as part of behavior modification techniques whose aim 
was to facilitate the tracking of problem behaviors, such as smoking and marital con- 
flict, and their associated contextual features (see Suls & Martin, 1993; Wheeler & 
Reis, 1991). For example, Duncan (1969) used self-monitoring to count the frequency of 
problem behaviors (e.g., food intake) in a group of adolescents. Nelson (1977) described 
techniques for counting targeted problem behaviors, such as the number of cigarettes 
smoked, paranoid thoughts, and “keeping one’s temper,” in response to “provocations 
to argue.” While these self-monitoring procedures shared the feature of defining discrete 
events of interest, this work was different from more recent event-contingent methods in 
the substantial reliance on procedures tailored to the individual patient. First an assess- 
ment would be conducted to identify the problem target behavior, then frequency records 
of these behaviors targeted for an individual would be kept related to treatment. More 
recently, standardized forms have been developed as part of some empirically supported 
treatment programs (Korotitsch & Nelson-Gray, 1999). The focal use of self-monitoring 
approaches has been with particular patients rather than with generalizing findings across 
patients. 

ECR has also been used within the context of cognitive psychotherapy. Beck, Rush, 
Shaw, and Emery (1979) developed a technique in which patients were trained to recog- 
nize negative automatic thoughts, and to count the frequency of these negative thoughts 
with the aid of a wrist monitor. With forms such as the Daily Record of Dysfunctional 
Thoughts, patients recorded in vivo information associated with an unpleasant emotion, 
such as the intensity of specific unpleasant emotions, the situation leading to the unpleas- 
ant emotion, the specific automatic thoughts leading to the emotion, the plausibility of 
these thoughts, and alternative interpretations of the event. This information was used 
in therapy to teach patients about the kinds of situations and dysfunctional thoughts that 
maintained their depression and anxiety. These early techniques shared with behavior 
therapy a focus on customizing the record form for a particular individual, and an absence 
of functionality for providing information that could be generalized across persons. 

More recent studies have used standardized forms in ECR studies of individuals 
with psychopathology. For example, event-contingent methods have been used success- 
fully to study the interpersonal functioning of adults with social anxiety disorder, adults 
with borderline personality disorder, adolescents at risk for bipolar disorder, and both 
adolescents and young adults with a history of self-injurious thoughts and behaviors 
(Linnen, aan het Rot, Ellenbogen, & Young, 2009; Nock, Prinstein, & Sterba, 2009; 
Russell, Moskowitz, Paris, Sookman, & Zuroff, 2007; Sadikaj, Russell, Moskowitz, & 
Paris, 2010). 

ECR methods have also been used in studies related to behavioral medicine in which 
the symptoms are sudden and have acute onset. For example, the feasibility of ECR has 
been demonstrated in studies of binge eating among individuals with eating disorders 
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(Stein & Corte, 2003) and studies of cigarette and alcohol consumption (Cooney et al., 
2007; Moghaddam & Ferguson, 2007; Todd et al., 2005). 

There are a few examples of the use of ECR and other intensive repeated measure- 
ment methods in industrial organizational settings. ECR has been used to study the dis- 
play of emotion at work (Tschan, Rochat, & Zapf, 2005) and the effects of shifting 
roles on the emotions and behaviors of individuals in organizations (Côté & Moskowitz; 
2002; Moskowitz, Suh, & Desaulniers, 1994). Beal and Weiss (2003) suggested that ECR 
is also suitable for studying low-frequency events within an organizational context. For 
example, research participants in a work setting could be trained to report on instances 
of altruistic behavior or instances of procedural justice and injustice in an organization. 

A common application of ECR has been in the study of social interaction. ECR has 
been used to study behavior, affect, perceptions, and appraisals in adults of varying ages. 
Sometimes the nature of the social interaction is left unspecified, and other times the 
focus is on a specific kind of interpersonal event, such as lying, feeling blamed by another, 
and marital conflict (DePaulo, Kashy, Kirkendol, Wyer, & Epstein, 1996; Papp, Cum- 
mings, & Goeke-Morey, 2009; Parkinson & Illingworth, 2009). 

In summary, while other applications are possible, ECR has been used primarily in 
three domains: clinical symptoms of psychopathology, health problem behaviors, and 
social interactions. The next section focuses on the use of ECR to study social inter- 
actions and describes in more detail two specific measurement strategies for collecting 
event-contingent data within this domain. 


Two Approaches to the ECR of Social Interactions 


Rochester Interaction Record 


A prominent ECR procedure developed by Wheeler and Nezlek (1977) is referred to as 
the Rochester Interaction Record (RIR). This widely used instrument is intended to mea- 
sure the nature and extent of participation in social life. Records of social interactions are 
typically collected for 1-2 weeks about every social interaction lasting at least 10 min- 
utes. A standard set of questions requests that the participant provide information about 
the social interaction. Items include questions about the duration of the interaction, and 
the number and gender of the other persons in the interaction and ratings of dimensions 
such as intimacy, self-disclosure, other disclosure, group inclusion, pleasantness, satisfac- 
tion, extent of initiation, and extent of influence. 

The stability of RIR scores was examined by Reis (1989; cited in Reis & Wheeler, 
1991). Scores obtained during a 2-week period were grouped into those obtained on odd 
and on even days, and then aggregated. Stability was high for the questions about dura- 
tion and other participants, and for the rating scales. 

The validity of the RIR scale indices has also been examined. Substantial agreement 
has been found between college roommate pairs (Hodgins & Zuckerman, 1990) on the 
number and duration of interactions. Validity has been further examined by comparing 
scores based on the RIR rating scales to questions of similar content using a single- 
occasion assessment procedure (Reis & Wheeler, 1991). Most of these correlations were 
modest in size, indicating some convergence between the methods and also some diver- 
gence, as would be expected given differences in the methodologies. 
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The original RIR and its modifications have produced a wide range of findings that 
illuminate influences on social participation and the quality of social interactions. For 
example, the method has been useful in the study of developmental trends in social par- 
ticipation, such as the effects of the transition to college, changes from college to the 
early adult years, and the effects of courtship progression on kinds of socializing (see 
Reis & Wheeler, 1991). The method has also been effective at demonstrating the impact 
of variables such as motivation, attractiveness, and cultural differences on the quality 
of social interactions and relationship satisfaction (e.g., Downie, Mageau, & Koestner, 
2008; Nezlek, 2000; Nezlek, Schütz, Schröder-Abe, & Smith, 2011). The RIR has been 
adapted to examine a variety of variables depending on the dimensions of interest. One 
variation is the Deception Record, in which the target event is telling a lie, and informa- 
tion is collected about the seriousness of and reason for the lie, and the target’s reaction 
to the lie (DePaulo et al., 1996). 


Social Behavior Inventory 


The Social Behavior Inventory (SBI) procedure developed by Moskowitz (1994) is an 
intellectual descendant of the RIR, with characteristics that distinguish it from RIR pro- 
cedures. It shares with the RIR a focus on social interactions and the collection of infor- 
mation about features of both the situation and the individual in the event. It is distinct 
from the RIR in a more explicit focus on specific kinds of behavior. 

The scales of the SBI were derived from the interpersonal circle model of interper- 
sonal traits (Kiesler, 1983). Item development focused on the four poles identified as 
primary by Wiggins (1991), which represent the agentic and communal dimensions of 
interpersonal interaction. The four behavior scales assess the degree of dominant, sub- 
missive, agreeable, and quarrelsome behaviors. The items were written to be relevant 
to situations occurring in a variety of settings, such as work and home. Examples of 
items for each of the behavior scales include “I set goal(s) for the other(s) or for us,” 
representing dominant behavior; “I waited for the other person to act or talk first,” rep- 
resenting submissive behavior; “I expressed reassurance,” representing agreeable behav- 
ior; and “I made a sarcastic comment,” representing quarrelsome behavior. To avoid the 
development of response sets by participants in the endorsement of items, multiple forms 
that have been developed are rotated throughout a study. Sample forms for an ECR SBI 
study are included in Appendix 9.1. The assignment of items to scales can be found in 
Moskowitz (1994). 

The items were initially developed by creating a pool of items based on a review of 
prior measures of dominance, submissiveness, agreeableness, and quarrelsomeness. As 
prior scales had often been used primarily with samples of university students, additional 
items relevant to work settings were generated in interviews with the staff at a large 
telecommunication firm. Potential items were stripped of any information referring to 
situations, so that the behavior items could be examined with respect to the situational 
features recorded concurrently with the behavior items. Items also referenced behaviors 
that were neutral or slightly positive with respect to social desirability, so people would 
feel comfortable endorsing the items ona regular basis during daily life. Two criteria were 
used to select items from the initial pool for the final scales: (1) ratings by experts and 
others of item similarity to the definition of the scale, and (2) correlations of items with 
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the Revised Interpersonal Adjective Scales (IAS-R; Wiggins, Trapnell, & Phillips, 1988), 
a carefully constructed measure of interpersonal circumplex traits. 

Using an ECR procedure that focuses on social interactions of 5 minutes or lon- 
ger duration, considerable evidence has been found to support the reliability and valid- 
ity of the SBI behavior scales (Brown & Moskowitz, 1998; Coté & Moskowitz, 1998; 
Moskowitz, 1994; Moskowitz & Côté, 1995; Moskowitz et al., 1994). There is high 
internal consistency among the items on each scale. The scales have high stability when 
scores are aggregated over a 12-day period. The pattern of correlations among the behav- 
ior scales corresponds to structural predictions based on the interpersonal circumplex. 
The behavior scales have demonstrated convergent validity and discriminant validity 
with one-occasion questionnaire measures. The scales are sensitive to predicted changes 
due to different situations, such as the effect of social role status on dominant and sub- 
missive behavior. 

There has been a range of research with the SBI. Issues examined with the SBI include 
the effects of gender and context on interpersonal behavior; the extent of variability in 
the interpersonal behavior of individuals with psychopathology such as borderline per- 
sonality disorder; and the effects of psychopathology, personality traits, and context on 
behavioral reactivity to interpersonal cues. The method has also been used to examine 
the within-person relation between interpersonal behavior and affect, and moderators of 
those relations. In addition, the method has been effective at detecting the effects of psy- 
chopharmacological interventions on interpersonal behavior in daily life. For overviews 
of findings with the measure, see Moskowitz (2009, 2010). 


Ten Tips for Using ECR Procedures 


1. When using ECR, an event should have a conceptual end, as well as a beginning, 
and data collection about the event should represent the whole event. A sample subsec- 
tion of an event, as might be obtained with random signals, may not tell a complete 
story. For example, in the study of power in a social interaction, it may not be sufficient 
to know how powerfully the person is behaving right now, as would be obtained in 
signal-contingent recording. Rather, it may be more important to know how powerful 
the person perceived him- or herself or the other person to be at the conclusion of the 
interpersonal interaction. 


2. Provide a specific description of the requirements of the study when making first 
contact with participants. Realistic information from the outset provides better-informed 
participants who are less likely to drop out after the study has begun, thereby reducing 
attrition. The description of the responsibilities of participants can be given verbally in an 
initial telephone conversation. In addition, the description of participation requirements 
can be provided in a website designed for potential participants. 


3. Be specific in the definition of the event. Have a clear understanding of when the 
event begins and ends. This definition must be understandable by the participants. 


4. Invest in the quality of the induction procedure for educating participants about 
the study’s procedures. Provide detailed and clear instructions to participants. Our stud- 
ies typically use an induction procedure that requires about 45 minutes. The induction 
guides participants through the procedure and provides guided practice in using the pro- 
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cedure. In a typical presentation, we educate participants about the definition of an event, 
discuss examples of events that might or might not be social interactions, and provide 
clarification about transition points between social interactions. Appendix 9.2 provides 
our instructions concerning “What is a social interaction?” In addition, we guide par- 
ticipants through completing a sample ECR form, which affords us the opportunity to 
explain the meaning of various questions. 


5. Encourage participants to feel that their contributions to the study and to science 
are important, and give attention to participants. Contact participants after a few days in 
the study to answer their questions or to review any procedures about which they might 
have concerns, and subsequently contact participants periodically during the study, either 
by e-mail or by phone. Develop procedures to respond promptly to a participant who 
attempts to contact the researchers. This attitude and these procedures help to sustain 
participants’ motivation through the lengthy procedure. 


6. Develop a procedure for checking participants’ compliance with the method. For 
example, if paper-and-pencil forms are used, they should be mailed daily and checked 
upon receipt for the accuracy of completion. Follow up immediately with the participant 
when inaccuracies and drifting from the procedure are noted. 


7. When measuring multiple kinds of variables, such as behavior and affect, con- 
sider using items with different response formats to avoid the confounding influence of 
method variance when examining the relations among different kinds of variables. More- 
over, we have found it useful to construct several versions of the ECR forms; the inclusion 
of different items in these versions can be a safeguard against any response sets that may 
develop during the course of the study. 


8. Whether using paper or an electronic presentation (e.g., handheld computer or 
smartphone), the presentation of question stimuli should be in a clear font that is suf- 
ficiently large to be easily read by members of the sample to be recruited. Larger fonts 
should be used when working with an older population. When using an electronic device, 
the screen must be sufficiently bright to be easily read by members of the designated 
population, and the screen must be sufficiently large that the whole question fits comfort- 
ably on the screen given the necessary font size. Size of screen is especially a concern when 
using questions that require more than one or a few words. For example, the description 
of a kind of behavior requires a longer question stem than the presentation of an emotion 
adjective. 


9. Pilot-test the procedure with individuals recruited from the designated popula- 
tion to uncover potential problems in the data collection procedure. For example, when 
using an electronic device, pilot participants can try to anticipate ways the device might 
be confusing to participants, such as ways participants could accidentally leave the data 
collection application. 


10. As with any measurement strategy, careful attention should be paid to selecting 
and defining the variables of interest. Why measure these variables? Do the questions 
identify an exhaustive list of items relevant to the variables of interest, or are the selected 
variables sampled from a larger domain? Similar questions can be asked about the events 
to be recorded. Do the events represent an exhaustive identification of pertinent events? 
If so, is the list of events too long for participants to retain in memory? Should there be 
some sampling from the designated domain of relevant events? 
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Advantages and Disadvantages of ECR 


When considering the use of an ECR approach to data collection, it is important to con- 
sider the relative appropriateness of the two other major designs of interval-contingent 
recording and signal-contingent recording. ECR designs are most appropriate for record- 
ing information about phenomena and the associated situational cues that occur in events 
with a well-articulated beginning and end. If an interval recording design, such as end-of- 
day procedure, were used to record information about such events, multiple focal events 
might occur during the specified interval. Participants could rely on memory and record 
as many focal events as they could remember at the end of the day. This would decrease 
the immediacy of recording, a desired hallmark of designs using intensive repeated mea- 
surements in naturalistic settings. In addition, completion of data records about multiple 
events at the same time could be a source of reactivity in the completion of data, either as 
the participant strives for the appearance of consistency and creates more similar records 
than actually occurred in the events, or strives for distinctiveness among records for dif- 
ferent events. 

When data collection is contingent on random signals, relevant events may be lost. 
Moreover, the less often an event occurs, the less likely an event and a signal will coin- 
cide. The signal-contingent method may be modified for events of interest, by instructing 
the person to record the last relevant event prior to the signal. Such an adaptation reduces 
the immediacy of recording. Also, if two events occurred prior to the signal, then one of 
the events would be lost. Instructions could be provided for the completion of records 
for multiple events at the time of the signal. However, completion of record forms for 
multiple events at the same time has the potential to distort data relative to the distinct 
events. 

Eckenrode and Bolger (1995) illustrated these advantages and disadvantages in their 
discussion of the appropriateness of different designs for the study of stress. Intensive 
repeated self-report studies of stress often use a daily interval-contingent recording proce- 
dure for collecting information regarding stress processes. An investigator might choose 
to use a daily interval recording method to be consistent with and have results that are 
easily integrated with prior studies. Yet it is possible that shorter recording units might be 
desirable. The effect of a stressful event might dissipate by the end of day, when the record 
would be collected for a study relying on daily records. An alternative design might use 
signal-contingent recording, in which there are multiple recordings each day. However, it 
is possible for a signal not to occur close in time to a stressful event, thereby reducing the 
immediacy of the recording of the reaction to the stressful event. An ECR design in which 
participants are instructed to complete a record form contingent upon the occurrence of 
a stressful event would have the advantage of immediacy in the measurement, with the 
consequent reduction of retrospection, which is a frequent goal of intensive repeated self- 
report measures. 

The practicality of the ECR design is dependent on the definition of a sufficiently 
small number of event categories, so that participants can remember to record these 
events when they occur. Moreover, categories need to be discrete; events such as exces- 
sive workload on the job or family demands do not usually have a sufficiently discernible 
beginning or end to provide participants with a trigger to complete a recording form. 
Interval-contingent or signal-contingent designs would probably better characterize these 
diffuse stressors. However, it may be possible to identify a small number of high-stress 
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categories, such as interpersonal conflicts, that are both important and sufficiently dis- 
crete for use in an ECR design to characterize stress processes. 

As participants are trained to focus on the details of particular events, some research- 
ers have considered it possible that ECR methods may engender more reactivity than 
other forms of intensive repeated measurements. For example, Merrilees, Goeke-Morey, 
and Cummings (2008) conducted a study with husbands and wives who participated in 
a 2-week ECR procedure that focused on the emotions and conflict behaviors of them- 
selves, their spouses, and their children during marital conflict episodes. The participants 
did not manifest any changes in behavior and affect in a laboratory study after having 
participated in the ECR study. However, the husbands did report decreased marital qual- 
ity over the course of the study. Although this study does not demonstrate that ECR 
produces greater reactivity to negative cues than other IRM-NS methods, the finding 
concerning reduced marital quality does raise the issue of measurement reactivity, dis- 
cussed by Barta, Tennen, and Litt (Chapter 6, this volume). While the extent of reactivity 
and the details of methods that may produce reactivity need to be explored further, it 
seems plausible that a method that only increases people’s attention to negative aspects 
of an event may produce increasing sensitivity to negative aspects of a social interac- 
tion. Investigators should consider constructing forms with balanced questions that give 
respondents the possibility to report on positive, as well as negative, aspects of the event. 
For example, response forms should be designed to have a similar number of positive and 
negative emotions. In the SBI, participants can report about both positive behaviors, such 
as agreeableness, and negative behaviors, such as quarrelsomeness. 


Examining Reliability in ECR Designs 


A difficult issue that arises in all IRM-NS designs is specifying the length of the sampling 
period or, in the case of ECR, specifying the length of time necessary to have a sufficient 
number of events for replicable measurements and analyses. Different researchers have 
used various durations for sampling periods. The number of events necessary for reliable 
results depends on the kind of variable measured and the kind of analysis conducted. 
When conducting analyses of the within-person relation between two variables, as a rule 
of thumb we have tried to collect at least 30 events for each kind of situation for which 
we plan to conduct analyses. However, when constructing profiles of individuals’ mean 
behaviors in each of a set of situations, we have found stable profiles using fewer events 
(Fournier, Moskowitz, & Zuroff, 2008). More studies are needed that explicitly examine 
the reliability of ECR scores to provide guidelines about the duration of the sampling 
period needed for a particular kind of construct and kind of analysis. 

It is possible to obtain more specific guidance for obtaining reliable data by applying 
generalizability theory (Brennan, 2001; Cronbach, Gleser, Nanda, & Rajaratnam, 1992; 
Shavelson & Webb, 1991) to examine the multiple sources of variance that influence a 
given score. Cranford and associates (2006) demonstrated the applicability of generaliz- 
ability theory to a daily interval-contingent recording study of affect, using a crossed 
design in which all participants completed records for the same days (also see Shrout 
& Lane, Chapter 17, this volume). While space limitations preclude presentation of the 
relevant formulas, it is also possible to apply generalizability theory to ECR designs in 
which records are nested within persons (see Brennan, 2001, Chapter 3; Shavelson & 
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Webb, 1991, Chapter 3). Variance decomposition of the scores can be obtained from a 
multilevel intercept-only model in which item scores are the dependent variable (Hox & 
Maas, 2006). Models can be estimated using PROC MIXED (SAS Institute, 2007). 

In generalizability analyses of unpleasant affect, we have found that the generaliz- 
ability coefficient for a random event is not large (e.g., .25 in one study), indicating that a 
person’s unpleasant affect in any randomly selected event is not likely to be strongly related 
to this person’s unpleasant affect in another randomly selected event. In other words, if we 
were to use scores from only one event, these scores would not generalize across all events. 
In contrast, when we have examined the generalizability of event scores aggregated over a 
large number of events (e.g., 80 events), we have found the generalizability of affect scores 
to be quite high (e.g., .96). It is noteworthy that the formula for aggregated event scores 
(Cranford et al., 2006, p. 925, Formula 4) can be used to estimate how many events we 
need from each participant to reach a desired level of reliability. Thus, it is possible to use 
generalizability theory formulas to vary the number of events and estimate the number of 
events necessary for a specified level of generalizability across events. 

There is also a generalizability estimate that corresponds to the internal consistency 
coefficient in classical test theory. It refers to the expected consistency of the item scores 
in a random event. It is important to note that to calculate this analogue to internal 
consistency, it is necessary to have multiple items. Studies that use single-item measures 
cannot calculate generalizability of items to the universe of items that could be used to 
measure a construct. 

Furthermore, there is a coefficient specific to generalizability theory that cannot be 
calculated within classical test theory. This generalizability coefficient (Cranford et al., 
2006; Shrout & Lane, Chapter 17, this volume) is associated with the reliability of the 
person x event interaction and has been referred to as indicating the reliability of change. 
This coefficient indicates consistency in the within-person variation of a person’s scores 
across events. In other words, the reliability of change indicates the extent of consistency 
in the profile of scores across different events. 

In summary, it is possible to use generalizability theory to estimate the reliability of 
scores generated from ECR measures that take into consideration their unique character- 
istics, most important, the nesting of events within individuals. Generalizability theory 
can be used to evaluate the number of events and the number of items necessary to obtain 
specified levels of reliability for ECR designs. For example, in our work, we have found 
low generalizability of affect scores based on a single event. Such findings suggest caution 
when using measures of affect based on a single event, and support the utility of methods 
that provide data about multiple events for providing reliable data. It is important that 
more studies be conducted on the generalizability of scores based on intensive repeated 
measurements to accumulate a knowledge base of the data necessary for obtaining reli- 
able scores in a variety of domains. 


Recent and Future Developments in ECR 


One recent trend is the collection of “passive” recordings of aspects of the individual 
or the environment concurrent with ECR data. For example, Sakamoto and associates 
(2008) collected recordings of locomotion along with event-contingent reports of panic 
attacks. Aan het Rot, Moskowitz, and Young (2008) collected recordings of exposure to 
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sunlight concurrent with event-contingent reports of affect and social behavior. These 
kinds of data complement each other. Event-contingent information helps to provide psy- 
chological meaning to the “mechanical” recording of information, such as changes in 
locomotor activity, while the passively collected information enriches and provides con- 
text for understanding of the self-reported information. 

Currently, ECR refers to the completion of data collection forms based on the par- 
ticipant’s judgment that a relevant event has occurred. A variation on this technique, 
referred to as context-triggered recording, designates that record forms are to be com- 
pleted contingent upon a signal that occurs when sensors define the occurrence of a perti- 
nent event (Intille, Chapter 15, this volume). For example, participants could be asked to 
complete a record form when ambulatory monitoring of a physiological response signals 
that a designated indicator has reached a specified level, such as heart rate exceeding 150 
beats per minute. In this case, the beginning of the event would be discernible by the 
physiological measure. An end to the event could also be defined (e.g., heart rate returns 
to participant’s baseline), and a record form could be completed at the end of the event. 
If this procedure were found to be burdensome to the participant, then a beginning and 
end of the physiological event could be defined, and the participant could be signaled to 
complete a record form after the end of the event. It should be possible to define a variety 
of physiological contexts that constitute events about which participants could be sig- 
naled to report. 

As previously described, a social interaction has often been used as the event to trig- 
ger a recording. An issue for this method is the determination of the proportion of desig- 
nated events for which the participant completes record forms. Unlike signal-contingent 
recording, there is no objective measure of the number of records that should have been 
completed, and when the record forms should have been completed. It might be possible 
to construct an idiographic standard for each participant by using audio recordings. Mehl 
and Pennebaker (2003) pioneered the use of random audio recordings of a person’s life, 
during which a person carries an audio recording device for a specified period. It is possi- 
ble to imagine a variation of this design that uses a voice-activated recorder to identify the 
beginning and end of a social interaction, permitting the calculation of the frequency and 
duration of social interactions to compare to a participant’s records. In another variation 
of this procedure, a device could be programmed to evaluate whether the duration of the 
social interaction is sufficient to signal the participant to complete a record form about 
a social interaction. There will still be errors, such as the voice-activated device detect- 
ing random voices as social interactions. It will be up to future research to determine the 
accuracy of the mechanical specification of an event’s beginning and end. 


Conclusion 


There are phenomena identified by a discrete beginning that can be assessed with greater 
specificity by ECR than by interval- or signal-contingent methods. Prominent among 
these phenomena are physical and psychological symptoms, as well as interpersonal 
interactions. The ECR records can provide information about both the phenomena of 
interest and the contextual cues present in these events. As events are nested within a 
person, ECR raises new issues with respect to the estimation of the reliability of mea- 
sures. Generalizability theory provides the opportunity to assess the reliability of specific, 
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nested events. Currently, the decision about whether an event has occurred is made by the 
participant, raising the issue that there is no independent verification of the occurrence 
of relevant events. Future developments may provide opportunities for the occurrence of 
a relevant event to be “mechanically” determined and signaled to a research participant, 
thereby providing context-triggered recording about events. 
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APPENDIX 9.1. Four Forms Using the Social Behavior Inventory (SBI) for ECR 


COMPLETE THIS FORM AS SOON AS POSSIBLE FOLLOWING A SOCIAL INTERACTION. 


Time of interaction? 


Briefly describe the social interaction 


Where did the interaction occur? home work 


am/pm Length of the interaction: 


Who was present? Please CIRCLE all those that apply 


M F SUPER- CO- SUPER- CASUAL FRIEND ROMANTIC PARENT 
VISOR WORKER VISEE ACQUAINT PARTNER 


Indicate the initials of the primary person 


If more than one other person was present, check here 


Did you do any of the following acts? Fill in the brackets beside each act you did. 


(SBI questions for Form 1) 


= e 
-= oO 


a N De Ah 


I listened attentively to the other 

I tried to get the other(s) to do something else 

I let other(s) make plans or decisions 

I did not say how I felt 

I confronted the other(s) about something I did not like 
I expressed affection with words or gestures 

I spoke in a clear firm voice 

I withheld useful information 

I compromised about a decision 


I took the lead in planning/organizing a project or activity 


. Lavoided taking the lead or being responsible 
12. 


I ignored the other(s) comments 


(SBI questions for Form 2) 


ec ND AR WN 


I criticized the other(s) 

I smiled and laughed with the other(s) 

I spoke softly 

I made a sarcastic comment 

I expressed an opinion 

I complimented or praised the other person 

I did not express disagreement when I thought it 


I gave incorrect information 
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other 


OTHER 


APPENDIX 9.1. (cont. 


9. I got immediately to the point [ 
10. I made a concession to avoid unpleasantness [ 
11. Idid not state my own views [ 


(SBI questions for Form 3) 


1. I waited for the other person to talk or act first [ 
2. Istated strongly that I did not like or that I would not do something [ 
3. I assigned someone to a task [ 
4. I exchanged pleasantries [ 
5. Idid not say what was on my mind [ 
6. I did not respond to the other(s) questions or comments [ 
7. Imade a suggestion [ 
8. Ishowed sympathy [ 
9. I did not say what I wanted directly [ 
10. I discredited what someone said [ 
11. Iasked the other(s) to do something [ 
12. Ispoke favorably of someone who was not present [ 


(SBI questions for Form 4) 


1. Ishowed impatience [ 
2. Lasked for a volunteer [ 
3. I went along with the other(s) [ 
4. I raised my voice [ 
5. Igave information [ 
6. I expressed reassurance | 
7. Igavein | 
8. Idemanded that the other(s) do what I wanted [ 
9. I set goals for the other(s) or for us [ 
10. I pointed out to the other(s) where there was agreement [ 
11. I spoke only when I was spoken to [ 


Note. Scoring of items for dominant, submissive, agreeable, and quarrelsome behaviors can be found 
in Moskowitz (1994). 
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APPENDIX 9.2. Sample Verbal Instructions: What Is a Social Interaction? 


First, I'd like to tell you exactly what we mean by a SOCIAL INTERACTION in the context 
of this study. It’s important for us to discuss this, because people sometimes have different ideas 
about what the phrase “social interaction” means. 


By a Social Interaction we mean any situation in which two or more people are involved, and 
are reacting or responding to one another for a minimum duration of five minutes. We set this 
5-minute guideline to ensure that you are not filling out forms all day long about simple greet- 
ings. Instead, you can focus on situations that involve at least 5 minutes of conversation. A good 
example is when you are having a conversation with a friend. But let me try to give you a few other 
examples of what constitutes a social interaction ... 


e If you are sitting beside your friend watching a movie, but not talking to your friend, then 
no interaction is taking place. However, if you are chatting on your way to the movies, or 
discussing the movie afterwards, then these would be social interactions! 

e If you are at work and your boss is giving you instructions or assigning you a task, this would 
be considered a social interaction so long as you are interacting with your boss, asking ques- 
tions, clarifying things, and so forth. If you are simply listening to the instructions and not 
responding, then this would not qualify as a social interaction. 

e A conversation on the telephone would count as a social interaction provided that it lasted at 
least 5 minutes. However, any form of electronic communication (such as e-mail, MSN, or 
text messaging) does not count as a social interaction. (Note. Skype is OK, as general rule, 
face-to-face, voice-to-voice.) 


So the three key factors in deciding whether or not a social interaction has occurred are: 


e That the interaction was 5 minutes or longer, AND 
e That there was mutual responding; that both of you were responding to each other, AND 
e That the interaction was either in person or over the telephone 


Is this clear? 


It is also important to know when Social Interactions begin or end so that you can determine when 
to fill out a form. We’ve developed a few guidelines to make this easier for you: 


e First of all, a Social Interaction changes when there is a discrete ending. For example, if 
you are discussing something with a friend and then you say goodbye and leave, that Social 
Interaction has obviously ended. However, there are other ways in which a Social Interaction 
may end. 

e If there is a change in the environment, then the interaction changes as well. For example, if 
you were having a conversation with some fellow employees at work, and then you go out to 
lunch with the same people, we would say that this constitutes two separate Social Interac- 
tions: The conversation at work is one social interaction, and the conversation over lunch is 
the second. There could also be a third social interaction that takes place on your way to the 
restaurant, provided that it takes you at least 5 minutes to get there, and that you are chatting 
along the way. 

e A Social Interaction also changes if the composition of the group changes. For example, if 
you are having a conversation with one of your friends and another person joins the conversa- 
tion, we would say that this change in group composition has ended the first Social Interac- 
tion. If you spend more than 5 minutes with these two people, then you could also record 
information about this second Social Interaction. 

e Finally, a Social Interaction changes if the tone or activity shifts. For example, let’s say you 
are having a conversation at home with your romantic partner, then the two of you decide 
to make lunch together. This would involve two Social Interactions: The first would be the 
initial conversation, and the second would be the interaction that occurs while you are mak- 
ing lunch together. 
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CHAPTER 10 


Naturalistic Observation Sampling 
The Electronically Activated Recorder (EAR) 


MATTHIAS R. MEHL 
MEGAN L. ROBBINS 


Br to many laypersons, psychologists are people-watchers; what they do is 
observe behavior. Ironically, whereas the observation of subjects in their natural 
habitat, or naturalistic observation, is a fairly common method in neighboring disciplines 
(e.g., anthropology, sociology, primatology), it has a remarkably thin history in psychol- 
ogy. In this chapter, we follow Trull’s recommendation of “expanding the aperture of 
psychological assessment” (2007, p. 1), and highlight and review one specific methodol- 
ogy for studying daily life that allows for the unobtrusive sampling of naturalistic obser- 
vations. 

Figure 10.1 shows a simplified method matrix. It organizes types of methods used 
in social science research according to whether data collection is based on self-reports or 
behavioral observation, and whether it takes place in the laboratory or in participants’ 
natural environments (for the sake of simplicity, physiological assessments are excluded). 
The upper-left quadrant shows the generic global/retrospective self-report questionnaire 
(e.g., a standardized behavior checklist). The upper-right quadrant of this matrix shows 
the typical self-report methods for studying daily life, such as self-report-based ambula- 
tory assessment (AA; Fahrenberg, Myrtek, Pawlik, & Perrez, 2007; Wilhelm & Gross- 
man, 2010), ecological momentary assessment (EMA; Stone & Shiffman, 1994), daily 
diary (Bolger, Davis, & Rafaeli, 2003), and experience sampling methods (ESM; Hekt- 
ner, Schmidt, & Csikszentmihalyi, 2007). The lower-left quadrant contains laboratory- 
based observational methods, such as the videotaping of couple (e.g., Heyman, 2001) or 
family interactions (e.g., Margolin et al., 1998). 

As apparent from Figure 10.1, the lower right quadrant, behavioral observation in 
the natural environment, is not well represented (for exceptions see Bussmann & Ebner- 
Priemer, Chapter 13, and Goodwin, Chapter 14, this volume). In fact, in psychology, 
extremely few studies have employed person-centered, naturalistic observation (e.g., 
Barker & Wright, 1951; Craik, 2000). Funder (2007) pointed out that, among other rea- 
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Data In the In the Natural 
Collection Lab Environment 


Via 
Self-Report 


Via 
Behavioral 
Observation 


FIGURE 10.1. A simplified method matrix. From Mehl (2007). Copyright 2007 by Hogrefe Pub- 
lishing. Adapted by permission. 


sons, this is because how one would go about collecting truly naturalistic behavioral data 
is not straightforward. Essentially, it seems, it would require a “detective’s report [that] 
would specify in exact detail everything the participant said and did, and with whom, in 
all of the contexts of the participant’s life” (p. 41). Because this is ultimately impossible, 
self-report-based momentary assessment methods are generally considered the best avail- 
able proxy to behavioral observation in the field (Conner, Tennen, Fleeson, & Feldman 
Barrett, 2009). 

From a multimethod perspective, however, momentary and global/retrospective self- 
reports share important method variance because both derive their data from participants’ 
reports of their introspections and perceptions; in the method matrix, both are located 
within the same row. Therefore, some of the concerns raised for global/retrospective self- 
reports potentially also apply to momentary self-reports (e.g., impression management, 
self-deceptive enhancement, limitations to what participants are aware of; Piasecki, Huf- 
ford, Solhan, & Trull, 2007). Thus, to complete the social science researcher’s toolkit, 
it would be desirable to fill the lower-right quadrant by complementing momentary self- 
report data with momentary observational data. 


A Method for the Naturalistic Observation of Daily Life: 
The Electronically Activated Recorder (EAR) 


Over the last 12 years, we have developed the Electronically Activated Recorder, or EAR 
(Mehl, Pennebaker, Crow, Dabbs, & Price, 2001), a method that unobtrusively samples 
acoustic observations of participants’ momentary environments within the natural flow 
of their lives. 


178 STUDY DESIGN CONSIDERATIONS AND METHODS OF DATA COLLECTION 


What Is the EAR? 


The EAR is a portable audio recorder that is set to record periodically brief snippets of 
ambient sounds. Participants attach it to their belts or carry it in a purse-like bag while 
going about their daily lives. In tracking moment-to-moment ambient sounds around the 
participants, the EAR yields acoustic logs of their days as they naturally unfold. In sam- 
pling only a fraction of the time, instead of recording continuously, it makes large-scale 
naturalistic observation studies feasible. 

Since its conception in 2001, the EAR has evolved from a modified microcassette 
recorder (Mehl et al., 2001; Mehl & Pennebaker, 2003b) via a microchip-triggered digital 
voice recorder (e.g., Hasler, Mehl, Bootzin, & Vazire, 2008; Holtzman, Vazire, & Mehl, 
2010; Mehl, 2006; Mehl, Gosling, & Pennebaker, 2006; Mehl & Pennebaker, 2003a) to 
today’s third-generation EAR system, which runs on a personal digital assistant (PDA, 
or handheld computer). The PDA-based EAR system has some critical advantages: (1) 
It is software-based and runs on regular, commercial devices (i.e., requires no custom- 
designed hardware); (2) it is available at a reasonable price (the cost of a PDA); and (3) it 
allows for freely programmed recording schedules (e.g., 30 seconds every 12.5 minutes, 
5 minutes every hour), as well as blackout periods with no recordings (e.g., overnight). 
Finally, because now the traditional, self-report-based AA methods and the EAR use the 
same electronic device, it is possible to merge both methodologies. Figure 10.2 illustrates 
how the PDA-based EAR system is worn by a person. 


PsycLab Mobile 
taperen Progress 


RECORDING 
Completi gn 


FIGURE 10.2. Picture illustrating how the PDA-based EAR-system is worn by a person. 
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How Does the EAR Compare to Traditional, Self-Report-Based 
AA Methods? 


As a psychological, real-time data capture method, the EAR compares most directly to 
self-report-based AA (or EMA) methods (Bolger et al., 2003; Conner et al., 2009; Stone, 
Shiffman, Atienza, & Nebeling, 2007). Table 10.1 summarizes important similarities 
and differences between the two methods. 

Both methodologies are naturalistic in their approach and based on ecological 
research perspectives (Fahrenberg et al., 2007; Reis & Gosling, 2010; Wilhelm & Gross- 
man, 2010). Whereas for traditional self-report-based AAs both paper-and-pencil and 
PDA-based versions are available (Hektner et al., 2007; see also Kubiak & Krog, Chapter 
7, this volume), the EAR runs only electronically. Also, using the distinction introduced 
by Conner and Lehman (Chapter 5, this volume), self-report-based AA data are provided 
actively through participants’ voluntary actions (e.g., checking a box in response to an 
item), whereas EAR data are collected passively through automatic recordings, without 
participants’ direct involvement (other than wearing the device). 

The most important difference between the two methods is the fact that traditional 
AA or EMA methods are based on momentary self-reports, whereas the EAR is based 
on momentary behavioral observation. Hence, the two types of methods adopt differ- 
ent assessment perspectives: the self, with the corresponding subjective, experiential 
account, versus the bystander, or observer, with the corresponding objective (i.e., “person 
as object”) account. 


TABLE 10.1. A Comparison between Self-Report-Based AA Methods 
and the EAR Method 


Self-report-based ambulatory 


assessment methods EAR method 
Approach Naturalistic Naturalistic 
Medium Paper and pencil, electronic (PDA) Electronic (PDA) 
Mode Active (data provided through Passive (data collected 
voluntary response) through automatic recording) 
Method Self-report Behavioral observation 
Perspective Self (agent) Other (observer) 
Awareness of assessment High Low after habituation 
Burden for participant Practical (interruption of daily Psychological (intrusion of 
life) privacy) 
Burden for researcher Preparing participants (instruction Preparing the sound data 
and training) (coding and transcribing) 
Data collection limited by ... Response burden Privacy considerations, lab 


capacity for data coding 


Optimized for assessment of... Subjective experiences and Objective social environments 
perceptions and interactions 


Note. From Mehl (2007). Copyright 2007 by Hogrefe Publishing. Adapted by permission. 
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Self-report-based AA methods by nature require participants’ awareness of the 
assessment. In contrast, the EAR operates imperceptibly; participants never know when 
the recorder is on or off. Furthermore, after an initial period of device-induced self- 
awareness (approximately 2 hours), participants generally habituate to wearing the EAR 
and often report forgetting about it for extended periods of time (Mehl & Holleran, 
2007). 

The two methods further differ in the burden they place on participants. Self-report- 
based AA methods come with the practical burden of requiring participants intermit- 
tently to interrupt the flow of their daily lives to answer a series of questions. This practi- 
cal burden creates an upper limit for the number of prompts that can be implemented per 
day, and the number of questions that can be asked per prompt. The practical burden of 
the EAR consists of wearing the device; this burden is relatively low and independent of 
the sampling rate or the amount of information that is extracted (i.e., coded) from the 
sound files. However, the EAR places a very different burden on participants: the psy- 
chological discomfort of knowing one is intermittently recorded (sometimes referred to 
as evaluation apprehension). Therefore, EAR data collection is limited to sampling rates 
that result in privacy intrusions that are tolerable for participants. 

Finally, the two methods also differ considerably in the kinds of burden they place 
on the researcher. With self-report-based AA methods, the researcher’s challenge consists 
of adequately instructing and training participants to ensure high compliance and data 
quality. With the EAR, very little participant preparation is necessary (other than creat- 
ing good rapport). Participants receive the device activated and, ideally, wear it without 
ever touching a button. However, with the EAR, a major challenge for the researcher lies 
in preparing the large amount of sound data. Somehow, the rich information contained 
in the sound files needs to be quantified. This usually means a sizable team of research 
assistants coding and transcribing sound data for hundreds of hours. Therefore, EAR 
data collection is often also practically limited by an investigator’s laboratory capacity for 
coding the large amount of sound data. 

Taken together, these practical and conceptual differences between traditional self- 
report-based AA methods and the EAR suggest that the two methodologies are best 
suited for slightly different assessments. In capturing the agent’s “insider” perspective, 
self-report-based AA methods are optimized for the assessment of participants’ subjective 
experiences and perceptions (e.g., thoughts, feelings, attributions). In contrast, in cap- 
turing the observer’s “outsider” perspective, the EAR is optimized for assessing audible 
aspects of participants’ objective social environments and interactions (e.g., social set- 
tings, communication behaviors, language use). 


What Information Can Be Extracted from the EAR Recordings? 


To extract relevant information from the sampled ambient sounds, researchers can either 
adopt a psychological rating or a behavior coding approach (Sillars, 1991). With the 
psychological rating approach, expert raters listen to the full set or selected segments 
of participants’ sound files and judge the degree to which they indicate the presence of 
a construct of interest. For example, relationship experts could rate captured conversa- 
tions with the participants’ significant other on relationship satisfaction, social support, 
expressed emotions, or protective buffering (Kerig & Baucom, 2004). Or communication 
experts could rate captured workplace conversations for how competent the participant 


The Electronically Activated Recorder 181 


appears in them (Holleran, Whitehead, Schmader, & Mehl, 2011). In these cases infor- 
mation is extracted at a molar, psychological level. Reliability can be determined from the 
consensus among the expert raters, and the construct validity of the ratings emerges from 
comparisons with established criterion measures (e.g., self- or spousal reports; coworker 
or supervisor ratings). 

In our research, we have primarily worked with behavior codings. With this approach, 
information is extracted at the molecular level of the raw behavior. Trained coders listen 
to all of a participant’s EAR recordings and code each sound file using a standardized 
coding system. Over the years, we have developed and refined the Social Environment 
Coding of Sound Inventory (SECSI; Mehl et al., 2006; Mehl & Pennebaker, 2003b) to 
capture acoustically detectible aspects of participants’ social environments and interac- 
tions. In its basic form, the SECSI comprises four category clusters: (1) the person’s cur- 
rent location (e.g., at home, outdoors, in transit; all inferred from ambient, audible cues 
to the location, such as the wind blowing outside or the sound of surrounding traffic 
while inside a car); (2) activity (e.g., listening to music, watching TV, eating); (3) inter- 
action (e.g., alone, talking, on the phone); and (4) emotional expression (e.g., laughing, 
crying, sighing). 

Conceptually, it captures information about how individuals (1) select themselves 
into social environments (e.g., displaying a preference for spending time in one-on-one 
vs. group settings) and (2) interact with their social environments (e.g., laughing or sigh- 
ing a lot; see Figure 10.3). Adding to the basic SECSI system, we have then developed 
more specific coding systems that aim at capturing, for example, the topics of students’ 
daily conversations (e.g., school, politics, entertainment, sex), coping-relevant aspects of 
patients’ interactions with their support networks (e.g., disclosure, positive or negative 
support received), and behavioral residue of meditation training in daily life (e.g., grati- 


Naturalistic 
Person—Environment 
Interactions Environment 


Interaction 


Environment 
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e.g., athome, e.g., working, e.g., alone, e.g., health, e.g., emotion e.g., 
at work, at going out, one-on-one, food, fashion, words, cognitive laughing, 
the gym, watching TV, in a group, relationships, words, past vs. sighing, 


outside, ata listening to on the phone, sex, politics, present tense, crying, 
restaurant/ music, eating, with a stranger, job, sports, fillers (“like”), arguing 


coffee shop, housework, with friend/ entertainment, swearing, 
atafriend’s religious family member, money pronouns (“I,” 
house participation with partner “we”) 


FIGURE 10.3. EAR assessment of participants’ daily social environments and interactions. 
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tude, affection, empathy). Finally, in our laboratory, coders also routinely transcribe all of 
the participants’ utterances captured by the EAR. We then content analyze the verbatim 
transcripts, usually using the Linguistic Inquiry and Word Count software (Pennebaker, 
Booth, & Francis, 2007), to obtain information about participants’ linguistic styles. 

To assess the psychometric properties of the EAR data, we obtain estimates of inter- 
coder reliability by having all coders of a study code a standard set of training EAR 
recordings. Consistent with the specific, concrete, and behavioral nature of the codings 
(e.g., “talking” or “laughing”), intercoder reliabilities tend to be high. The majority of 
the SECSI categories have reliabilities that exceed r = .80 (Mehl et al., 2006; Mehl & 
Pennebaker, 2003b; Vazire & Mehl, 2008). It is an advantage of the coding approach 
that behavior codings at the molecular level (e.g., “talking”) are less susceptible to inter- 
pretational ambiguity than psychological ratings at the molar level (e.g., “relationship 
satisfaction,” “competence”). Yet, to the extent possible (given the large amount of data), 
it is recommended that at least two independent research assistants code the data to 
increase reliability. 


What Are Ethical Considerations around the EAR Method? 


Recording ambient sounds around participants raises ethical and legal questions. EAR 
studies conducted in our laboratory routinely implement a series of safeguards to protect 
participants’ privacy and to ensure confidentiality of the data (Mehl, 2007). We have 
found these safeguards to be highly effective at alleviating concerns participants may 
have about the method. First, the EAR is programmed to record only a fraction of a per- 
son’s day. Our original pattern, 30 seconds every 12.5 minutes, recorded less than 5% of 
the time and left more than 95% of participants’ days private in the first place, but still 
yielded almost an hour of audio data per day. Now we usually sample 50 seconds every 
9 minutes, which still leaves 90% of the time unrecorded. Second, the recordings are 
kept short; 30-second or 50-second recordings are long enough to extract basic behav- 
ioral information reliably, yet they are short enough to capture little contextualized per- 
sonal information. Finally, and most importantly, all participants can listen to their EAR 
recordings and delete parts they do not want on record, before the investigators access 
the data. In one study (Mehl et al., 2006), 19 out of 96 participants (19.8%) reviewed 
their recordings, but only three erased sound files (10 in total). In another study with 13 
arthritis patients, only one out of 2,948 waking sound files was erased (Robbins, Mehl, 
Holleran, & Kasle, 2011). This suggests that participants feel generally quite comfortable 
sharing the sounds of their daily lives under the safeguards we routinely implement. 

However, the more serious concerns revolve around not the participants themselves 
but bystanders, who are not directly involved in the study but whose behaviors are cap- 
tured by the EAR. In the United States, there are very few restrictions about recording 
people’s utterances in public places. The situation concerning the recording of private 
conversations is more ambiguous. In most parts of the United States (including Texas and 
Arizona, where the studies from our laboratory have been conducted), recordings can be 
made legally if at least one of the interactants (e.g., the participant who is wearing the 
EAR) has knowledge of the recording device. A small number of states allow recordings 
only if all interactants have knowledge of the recording device. Even in the most legally 
restrictive states, however, unauthorized recordings are problematic only if they are per- 
sonally identifiable. 
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In EAR studies from our laboratory, participants are encouraged to wear the micro- 
phone visibly and to mention openly the EAR in conversations with others. Irrespective 
of such notification, anonymity of other people’s utterances is of paramount importance, 
because their behavior is collected without explicit informed consent. As mentioned ear- 
lier, the brief recording snippets minimize the chance that personally identifying informa- 
tion about a third person is captured. For further protection, the sound files are coded by 
a trained staff that is certified for research with human subjects. In the coding process, 
then, any personally identifying information is omitted from the transcripts. Finally, par- 
ticipants always have the option of erasing sound files before the researchers can access 
them. It is thus highly unlikely that the EAR paradigm, as we have established it, violates 
privacy rights of people who are inadvertently recorded. 


How Obtrusive Is the Method and How Well Do Participants Comply 
with Wearing the EAR Device? 


The EAR method requires participants to tolerate being recorded intermittently, without 
knowing exactly when. This can create evaluation apprehension and result in reactance 
(i.e., censored or artificial behavior) or noncompliance (i.e., not wearing the EAR). Thus, 
it is critical to estimate how obtrusive the method is in daily life and how well partici- 
pants comply with it. 

Mehl and Holleran (2007) addressed these questions by analyzing measures of both 
self-reported and behaviorally assessed EAR obtrusiveness and compliance in two sam- 
ples: short-term (2 days; Mehl et al., 2006) and longer-term (10-11 days; Mehl & Pen- 
nebaker, 2003a) monitoring. Self-reported obtrusiveness was measured with items such 
as “To what degree were you generally aware of the EAR?” or “To what degree did the 
EAR impede on your daily activities?” As a behavioral measure of obtrusiveness, the 
coders counted the number of sound files in which participants mentioned the EAR with 
others. As a self-report measure of compliance, participants reported what percentage 
of the day they were wearing the EAR. Finally, as a behavioral compliance measure, the 
coders counted the number of sound files that indicated that participants were not wear- 
ing the EAR. “Not wearing the EAR” was coded if over a 30-second recording period no 
ambient sounds at all were recorded—not even sounds of breathing or clothes rubbing 
against the microphone. 

The analyses painted the following picture about the method’s obtrusiveness. 
Closely after receiving the EAR, participants go through a brief period of heightened self- 
awareness in which conversations about the EAR are frequent. Within 2 hours of wearing 
the device, however, most participants habituate to the method and rarely mention it with 
others anymore (Panel A of Figure 10.4). This habituation effect was found for not only 
the short-term monitoring but also longer-term monitoring. In the longitudinal sample, 
some individuals initially talked about the method more than others; yet after 5—6 days 
of wearing the device, virtually all participants had adjusted to it and barely mentioned it 
anymore in their daily conversations. 

The study further yielded the following findings about participants’ compliance. In 
the short-term monitoring, participants’ compliance was very high in the first hours after 
they had received the EAR. Noncompliance gradually increased over time and leveled 
off at about 10-12% on the second day of monitoring (Panel B of Figure 10.4). Compli- 
ance in the longer-term monitoring was high for at least 6 days. After that, variability 
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FIGURE 10.4. Behaviorally assessed obtrusiveness of and compliance with the EAR Method. 
From Mehl and Holleran (2007). Copyright 2007 by Hogrefe Publishing. Reprinted by permis- 
sion. 


in noncompliance increased, suggesting that some participants’ tolerance threshold may 
have been reached. 

The compliance data reported in Mehl and Holleran (2007) are based on two stud- 
ies with undergraduate student samples. We have since run a series of EAR studies with 
samples of older, working adults (e.g., individuals with rheumatoid arthritis, couples in 
which one member was receiving treatment for breast cancer, faculty members at a Tier I 
research institution) and have obtained highly comparable results regarding EAR obtru- 
siveness and compliance. 

Taken together, this suggests (1) that EAR compliance and obtrusiveness can be 
reliably assessed, (2) that compliance is generally high and comparable to what has been 
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reported for self-report-based EMAs (Green, Rafaeli, Bolger, Shrout, & Reis, 2006), and 
(3) that after an initial habituation period of about 2-3 hours, the method operates fairly 
unobtrusively and does not impede participants much in their normal activities. 


What Things Are Captured on the Sound Files? 
To What Extent Does the EAR Reveal Real Life? 


As the metaphorical “researcher’s ear” on the participant’s lapel, the EAR essentially 
eavesdrops on people’s daily lives. That naturally begs the question of what things are 
captured on the sound files and to what extent the EAR reveals real life. One of our first 
“aha!” experiences when we started doing EAR research was to realize how ordinary and 
mundane real life really is. The sound files we obtained from participants, first and fore- 
most, documented that for most people, most of real life is not really thrilling, glittery, 
and extraordinary. In the end, our daily lives tend to be fairly “average.” In the words 
of one of our participants (after listening to her own sound files), “I, as probably most 
people, like to think of myself as interesting and superior. Listening to myself, however, 
I have concluded that I am most certainly not. I am just like everybody else.” Much of 
what the EAR captures is either silence (apart from ambient noises) or rather banal and 
linguistically unrefined utterances that reveal participants engaged in the pursuit of their 
daily activities (e.g., school, commuting, watching TV). In essence, the majority of the 
sound files speak to the ordinary, humdrum nature of daily life (Craik, 2000). 

Yet, with its fine-grained grid of observations, the EAR also regularly captures some 
of the less publicly presentable aspects of a person’s social behavior. For example, the EAR 
at times catches intimate conversations, as well as emotional outbursts, arguments, and 
profanity. In addition to documenting a person’s behavior “on stage,” it also reveals some 
of those moments when humans are caught off guard and show their usually hidden, weak, 
and unpolished faces (Goffman, 1959). This potential of the EAR to capture “off-stage” 
behavior can, for example, be used by researchers to get a better handle on the assess- 
ment of theoretically important but notoriously methodologically difficult-to-measure, 
evaluatively loaded behaviors, such as small talk (Mehl, Vazire, Holleran, & Clark, 2010), 
swearing (Robbins et al., in press), or negative social support (Mehl, 2007). 


How Frequently and for How Long Should the EAR Sample? 


The first EAR studies all sampled at a rate of 30 seconds every 12.5 minutes. This some- 
what strange sampling pattern was initially chosen (1) to obtain about five data points per 
hour, (2) to avoid the oversampling of periodic behaviors (e.g., the news at the full hour), 
and (3) indeed, to fit 1 day of monitoring onto one side of a 90-minute microcassette! 
Ironically, what started largely as a product of pragmatics and good guesses turned out to 
be a rather effective solution. This sampling pattern not only has resulted in a number of 
unique findings (e.g., Mehl, Vazire, Ramirez-Esparza, Slatcher, & Pennebaker, 2007) but 
also has turned out to be psychometrically sound. Simulating different sampling patterns 
from continuous audio recordings of 11 children, Fellows, Hixon, Slatcher, and Penne- 
baker (2010) recently confirmed empirically—using a given sampling rate—that shorter 
(30 seconds) recording segments are superior to longer ones (90 seconds) for adequately 
representing the participants’ full-day behavior. Furthermore, and not surprisingly, higher 
sampling rates outperformed lower ones (e.g., 12.5 vs. 2.5% of the time). 
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In the more recent EAR studies, we have switched to sampling for 50 seconds every 
9 or 18 minutes, resulting in approximately 5 or 10% of the day being recorded. We 
implemented this modification (1) to accommodate reviewers’ concerns about more con- 
texts seeming necessary for obtaining valid codings and (2) to capture additional context 
for coding constructs at a more molar, psychological level (e.g., disclosure). Because our 
recording segments are still relatively short, and because we sample at a relatively high 
rate, our current pattern is well supported by the Fellows and colleagues’ (2010) findings. 
Ideally, though, future research would further explore optimal sampling rates for behav- 
iors of varying frequency and duration. 

A final question concerns the duration of the EAR monitoring. In our first EAR 
study (Mehl & Pennebaker, 2003b), participants were wearing the EAR twice for 2 full 
days, separated by approximately 4 weeks. The analyses revealed high stability coef- 
ficients for most behaviors, suggesting that 2-day monitorings are sufficient to capture 
habitual aspects of people’s daily social environments and interactions. To be sure, longer 
monitoring periods are always preferable (and we have employed them in some studies). 
Yet given the labor intensity of the coding, longer monitorings in reality often compete 
against larger samples, with, based on our experience, the sample-size advantage being 
stronger than the number-of-occasions advantage (see also Bolger, Stadler, & Laurenceau, 
Chapter 16, this volume). 


How Can Researchers Get the EAR? 
What Practical Things Are Important to Know to Use It? 


Researchers can obtain a copy of the software directly from our laboratory free of cost 
(and only need to purchase a PDA device, a Secure Digital [SD] card, and a protective 
case to get started). All we ask in return is that users share their experiences and provide 
feedback on how the system can be improved. Our hope is that this change will lower 
the psychological and economic hurdles for researchers interested in the EAR, and foster 
more widespread use of the method. 

It is beyond the scope of this chapter to provide all the necessary practical informa- 
tion for running an EAR study (due to technical progress and commercial device life 
cycles, this information is also subject to frequent change). However, we have main- 
tained a researcher’s guide with this purpose in mind. This guide is available from our 
laboratory and contains (1) hardware recommendations (e.g., the software runs only on 
Windows Mobile-operated handheld computers or cell phones), (2) instructions for how 
to install and use the software, (3) a sample consent form, (4) a set of standard question- 
naires (e.g., EAR compliance and obtrusiveness questionnaire), and (5) a script for how 
to administer the EAR. Apart from providing this guide, we are always happy to help 
“jump-start” and troubleshoot. 


What Is the Added Value 
of the EAR Method for Studying Daily Life? 


Conceptually, the EAR now provides a tool in the social science researcher’s toolbox for 
person-centered behavioral observations in the natural environment. Ultimately, though, 
methodologies justify their existence not from filling quadrants in a method matrix, but 
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from bringing unique potentials to psychological assessment. What, then, is the added 
value of the EAR method? In this section, we discuss three ways in which the EAR, as 
a naturalistic observation method, can uniquely inform the psychological study of daily 
life. 


It Can Help Calibrate Psychological Effects against Frequencies 
of Real-World Behavior 


A persistent problem that the EAR can help resolve is the calibration and interpretation 
of psychological effects. In a seminal article, Sechrest, McKnight, and McKnight (1996) 
pointed out that “very few psychological measures of any kind are expressed in a metric 
that is intuitively or immediately meaningful” (p. 1065), and that the discipline would 
benefit from developing “a better understanding of the measures by which the phenom- 
ena with which we concern ourselves are gauged” (p. 1068). Indeed, the vast majority 
of standardized measures use arbitrary metrics, such as 5- or 7-point rating scales. Such 
measures have no clear referents that inform about what scoring at a certain level (e.g., a 
“5” on optimism) means for how a person fares in important domains of life. For exam- 
ple, how much less time does a person with a “4” on an extraversion scale spend alone 
compared to a person with a “2”? And, how does a person’s daily life change if an inter- 
vention reduces his depression score by 7 points? Finding answers to questions like these 
is crucial for understanding the implications of psychological effects. Yet the field has 
struggled greatly with accomplishing this (Blanton & Jaccard, 2006; Kazdin, 2006). 

One advantage is that the EAR’s sound file-based behavioral codings can be readily 
converted into a metric that is nonarbitrary, intuitively meaningful, and inherently real 
world-relevant. If the EAR captures a person talking in 40 out of 120 recordings, one 
can estimate that the person spent about one-third of the time awake (or about 5 hours) 
talking. Or if TV sounds are present in 20% of the recordings, one can estimate that 
the person was (actively or passively) watching TV 20% of the waking day (or about 
3% hours). By linking EAR-derived frequencies of daily behavior to the metrics of mea- 
sures, a better understanding of the real-world implications of psychological effects can 
be obtained. 

For example, in one EAR study (Mehl et al., 2006), Extraversion was correlated 
(r = -.27) with time spent alone, and Conscientiousness (r = .42) with time spent in class. 
Converted into a more meaningful metric, this suggests that participants who marked a 
“4” on the 5-point Extraversion scale spent almost 10% less time alone than those who 
marked a “2” (70.8 vs. 61.4%). And participants who marked a “4” on the 5-point Con- 
scientiousness scale spent about three times more time in class than those who marked a 
“2” (11.9 vs. 4.1%). Similarly, testing the myth that women are by a factor more verbose 
than men, Mehl and colleagues (2007) revealed, based on six EAR studies, that both men 
and women use about 16,000 words per day. Compared to a range of over 46,000 words 
between the least and most talkative individual (695 vs. 47,016), a sex difference of 546 
words rendered significance testing close to meaningless—and spoke impressively to the 
magnitude of individual differences. Finally, Mehl et al. (2010) recently found that well- 
being is related to having less small talk and more substantive conversations. The magni- 
tudes of these effects were vividly illustrated by the fact that, compared to the unhappiest 
participants (-2.0 SD), the happiest ones (+1.5 SD) had roughly one-third as much small 
talk (10.2 vs. 28.3%) and twice as many substantive conversations (45.9 vs. 21.8%). 
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Thus, in facilitating an absolute metric in the measurement of daily social behaviors 
and environments, the EAR can help “benchmark” and interpret psychological effects. 


It Can Provide Ecological, Behavioral Criteria That Are Independent 
of Self Report 


The “criterion problem” is a vexing issue in the field (Kruglanski, 1989). How can we 
study how much self-insight people have into “how they really are” if the only way to 
assess “how they really are” is to ask them how they are? Or, more broadly, how can we 
study processes underlying self- and social perceptions if we do not have a way to line 
up these perceptions with independent measures of “actual reality”? In cases where it is 
necessary to measure behavioral criteria independent of self-report, the EAR can help 
accomplish this. 

For example, Vazire and Mehl (2008) tested the accuracy of self- and other-reports 
by comparing the predictive validity of participants’ self-ratings of how much they engage 
in different daily behaviors (e.g., talking on the phone, laughing, watching TV, listening 
to music) to similar ratings obtained from people who knew the participants well. The 
frequency with which the EAR captured participants actually engaging in these behav- 
iors (e.g., actual time spent on the phone over a period of 4 days) served as an “impar- 
tial” accuracy criterion. Self- and other-ratings not only showed identical validity but 
also uniquely predicted certain behaviors. For example, whereas the self was better at 
estimating the amount of time spent arguing, friends had a more accurate picture of how 
sociable participants were, that is, how much time they spent in the company of others. 
Importantly, to avoid giving one perspective an undue predictive advantage, it was criti- 
cal to minimize shared method variance with both. The EAR-derived behavior counts 
maximally accomplished this, while preserving the study’s ecological focus. 

Similarly, responding to Terracciano and colleagues’ (2005) influential finding that 
national stereotypes have zero validity, Heine, Buchtel, and Norenzayan (2008) argued 
that “comparing means on subjective Likert self-report scales is the most commonly used 
method for investigating cross-cultural differences, yet there are many methodological 
challenges associated with this approach” (p. 309). Following their advice to concen- 
trate on behavioral trait markers, Ramirez-Esparza, Mehl, Alvarez Bermúdez, and Pen- 
nebaker (2009) compared Americans’ and Mexicans’ sociability in a binational EAR 
study. They found that although American participants reported being more sociable 
than their Mexican counterparts, they spent less time with others and had fewer social 
(i.e., noninstrumental) conversations. Intriguingly, whereas Americans rated themselves 
significantly higher than Mexicans on the item “I see myself as a person who is talkative,” 
they in fact spent almost 10% less time talking (34.3 vs. 43.2%). Thus, a behavior count- 
ing approach, such as the one employed by the EAR, can help with circumventing meth- 
odological problems around the use of self-report in cross-cultural research. 

Finally, in collaboration with Dr. Raison from the Mind-Body Program at Emory 
University, we are currently running a randomized controlled trial using the EAR method 
to test how meditation training changes its practitioners’ social behaviors and environ- 
ments. There is consensus among meditation researchers that a clear demonstration of the 
real-world, prosocial effects of meditation is crucial for the field. Importantly, however, 
retrospective and even momentary self-reports cannot unambiguously distinguish change 
at the level of the self-concept from change at the objective, behavioral level. If partici- 
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pants report having emerged as kinder, calmer, and more compassionate after meditation 
training, is it because meditation aims at and works though transforming (self-)percep- 
tions—and therefore the self-report—or because it leads practitioners, in fact, to engage 
in more kind, calm, and compassionate acts in their daily lives? Coding the EAR sound 
files for acts of gratitude, altruism, and compassion can circumvent this methodological 
issue and help provide the direct behavioral evidence the field has been seeking. 

Taken together, in providing ecological, behavioral criteria that are independent of 
self-reports, the EAR can contribute in unique ways to resolving important questions in 


the field. 


It Can Help with the Assessment of Subtle and Habitual Social Behaviors 
That Evade Self-Report 


Asking participants to report accurately on subtle or habitual social behaviors is a task 
that often goes beyond what self-report can accomplish. For example, Schwarz (2007) 
demonstrated that frequent mundane behaviors, like sighing or laughing, tend to be par- 
ticularly difficult for participants to report retrospectively because occurrences become 
indistinguishable and irretrievable. Though self-report measures can inform about par- 
ticipants’ self-perceptions, they often do not yield good representations of the true preva- 
lence of subtle and habitual behavior. Multiple studies have illustrated just how precari- 
ous self-reported estimates of behavior frequencies can be (Schwarz, 2007). Momentary 
or end-of-day event diaries can evade memory problems associated with retrospective 
self-reports, but—as Piasecki and colleagues (2007) have pointed out—even in event dia- 
ries, participants can report only what they noticed and remembered when they had to 
complete the diary. In the stream of our daily lives, subtle and habitual behaviors often 
simply do not pass the threshold of consciousness. Therefore, the study of such subtle and 
habitual aspects of our daily lives necessitates a behavioral observation approach. 

One study that exemplifies how the EAR can capture subtle social behaviors tested 
the degree to which spontaneous sighing is a behavioral indicator of depression among 
rheumatoid arthritis (RA) patients (Robbins, Mehl, Holleran, & Kasle, 2011). Thirteen 
RA patients wore the EAR (recording 50 seconds every 18 minutes) for two weekends sep- 
arated by 1 month. Depression and physical symptoms were assessed via questionnaires. 
As an “an obvious exaggerated exhalation of breath” (Keefe & Block, 1982, p. 366), 
incidents of sighing were readily captured by the EAR and could be reliably coded from 
the sampled ambient sounds. Interestingly, sighing was significantly and strongly related 
to patients’ levels of depression and nonsignificantly and less strongly to their reported 
pain and number of flare days. Because of the small sample size, the findings are prelimi- 
nary in nature, yet they suggest that sighing can be an observable marker of depression 
and may be more of a depression behavior than a pain behavior among patients with RA. 

Other behaviors are less subtle but highly automatic and thus difficult to report. 
For example, swearing’s habitual and nonfocal nature in everyday conversations makes 
it difficult to self-report (Jay, 2009). In a recent study of the intra- and interpersonal 
consequences of swearing, we combined data from two pilot studies of 13 women with 
RA and 21 women with breast cancer (Robbins et al., 2011). Participants wore the EAR 
on weekends to track their daily conversations. All sound files were transcribed and 
submitted to Linguistic Inquiry and Word Count (LIWC; Pennebaker et al., 2007) to 
determine their degree of swearing. In addition, participants completed self-reported 
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measures of depression and emotional support at the start of the EAR weekend and sev- 
eral months later at the follow-up. Consistent with the idea that swearing can repel sup- 
port at the downstream expense of psychological adjustment, swearing in the presence 
of others, but not alone, was related to decreases in emotional support and increases in 
depressive symptoms over the study period. Furthermore, decreases in emotional support 
mediated the effect of swearing on disease-severity-adjusted changes in depressive symp- 
toms. Again, these effects are preliminary in nature and may well be limited to women in 
midlife for whom swearing violates gender and age norms. Yet, together with the sighing 
findings, they highlight the importance of investigating behaviors that play an important 
role in daily life but are often too subtle or habitual for participants to report retrospec- 
tively or in the moment. 


Summary and Conclusion 


Our purpose in this chapter has been to provide a review and discussion of a still relatively 
young naturalistic observational sampling method: the Electronically Activated Recorder, 
or EAR. As the metaphorical researcher’s ear on the participant’s lapel, it eavesdrops on 
people’s daily lives and provides highly naturalistic, experientially vivid, and psychologi- 
cally rich information about their moment-to-moment (acoustic) social worlds. Within 
the research methods for studying daily life, the EAR clearly occupies a methodological 
niche; it is not for everyone and everything. It is highly labor-intensive and thus requires 
careful deliberation as to when it should be used instead of more economic methods (e.g., 
experience sampling, daily diaries). However, in providing ecological behavioral mea- 
sures that are independent of self-report and often beyond what self-report can capture, 
it can yield valuable findings that are difficult to obtain otherwise and support the field 
in the mission gradually to “put a bit more behavior back into the science of behavior” 
(Baumeister, Vohs, & Funder, 2007, p. 401). 
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CHAPTER 11 


Ambulatory 
Psychoneuroendocrinology 


Assessing Salivary Cortisol 
and Other Hormones in Daily Life 


WOLFF SCHLOTZ 


P sychoneuroendocrinology (PNE) is an interdisciplinary research area focusing on 
interactions among brain, behavior, and hormones. Such interactions take place in 
two directions. First, because the secretion of many hormones is the end-product of sig- 
nal cascades that originate in the brain, psychological factors can directly influence endo- 
crine systems via neuronal processes. Second, hormones can influence neuronal activity, 
and hence psychological processes, by binding at receptors within the brain. Due to these 
interactions, PNE research is of fundamental importance to understand brain-body 
interactions. Because hormones interact with all organ systems, understanding psycho- 
logical factors that influence endocrine activity is also highly relevant for understanding 
health and disease. 

Traditionally, PNE research relied on laboratory studies, which allow a high level of 
control over exposures and assessment of hormone concentrations from blood by medi- 
cally trained staff. The development of methods for assessment of hormones and other 
biomarkers from saliva in the 1970s presented the opportunity to take PNE research out 
into the field. 

This chapter describes the use of salivary biomarkers in the context of research in 
daily life. The chapter focuses on methods of assessment of hormones and other biomark- 
ers in saliva from a practical perspective. Due to its widespread use, the presentation of 
salivary cortisol assessment methods is given priority. Salivary assessment of other hor- 
mones and biomarkers is also discussed in a short section. 


Assessment of Hormones 


Hormones can be assessed from different sources, implying different practical procedures 
and different indication functions of the resulting hormone concentrations. 
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Assessment of Hormones in Blood, Urine, and Hair 


Assessment of hormones from blood is the most widely used method. The procedure 
of taking blood is invasive and triggers a physiological response to venipuncture, and 
it requires medically trained staff. For these reasons blood sampling of hormones is not 
suitable for ambulatory assessment protocols. Although ambulatory assessment of hor- 
mones (and metabolites) in urine can be done, it interferes strongly with daily life. Due 
to aggregation of analytes in urine over time, the measure reflects aggregated hormone 
levels (e.g., 24-hour cortisol), which limits its usefulness for daily life research. Rela- 
tively recently, hair analysis of steroid hormones, such as testosterone and cortisol, has 
been proposed. Similar to retrospective analysis of drugs or toxins in hair, assessment of 
hair cortisol has been shown to be a promising candidate for retrospective assessment of 
aggregated cortisol production over a period of some months, whereas, at the moment, it 
seems not to be suitable for studying dynamics of everyday life using ambulatory assess- 
ment methods. 


Assessment of Hormones in Saliva 


In contrast, the assessment of hormones and other analytes in saliva provides a more 
flexible assessment tool. Because saliva collection is noninvasive and independent of spe- 
cialized medical personnel, salivary assessment of hormones and other biomarkers is 
suitable for ambulatory assessment of endocrine activity in everyday life. Using adequate 
assessment designs, saliva collection can easily be done by participants themselves in 
many different environments. Compared to laboratory assessments, such studies yield 
assessments with higher ecological validity (see Reis, Chapter 1, this volume), and com- 
bining both approaches (see F. Wilhelm, Grossman, & Miller, Chapter 12, this volume) 
might generate new insights into psychoendocrine processes. Repeated assessments can 
capture relatively short-term fluctuations in hormone concentrations, including within- 
person analysis. Combining salivary biomarker assessments with ambulatory assessment 
of psychological, other physiological, or environmental variables potentially provides 
important information about interactions of such variables in everyday life. 

Saliva can be collected in a wide variety of environments and subject groups. To 
give just a few examples, studies have implemented assessments of salivary biomarkers in 
traders on the trading floor in the City of London, in couples during couple interaction, 
in nurses and physicians at their workplace in the hospital, in students before and after 
academic examinations, and in competitive ballroom dancers on the dance floor. In addi- 
tion, salivary biomarkers can be assessed in subjects when noninvasiveness is paramount 
(e.g., children). Salivary biomarker assessments have been successfully implemented in 
ambulatory assessment studies in subjects with clinical psychological conditions (e.g., 
borderline personality disorder, major depression, and driving phobia). 


Saliva Secretion and Serum-to-Saliva Passage 


Saliva is a clear mucoserous, exocrine dilute fluid that is secreted mainly from acinar cells 
of three pairs of major glands, the parotid, submandibular, and sublingual glands, with 
a small amount being secreted from minor glands (Humphrey & Williamson, 2001). 
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Most PNE studies collect whole saliva, in contrast to gland-specific saliva. Whole saliva 
includes additional constituents of nonsalivary origin, such as gingival crevicular fluid, 
which leaks from the tooth-gum margin, as well as serum and blood derivatives from 
oral abrasions or lesions, food debris, and other minor constituents (Kaufman & Lam- 
ster, 2002). The secretion of saliva is controlled by sympathetic and parasympathetic 
innervations of salivary glands (Baum, 1993). 

Serum constituents that are not part of the normal salivary secretion can reach saliva 
by a variety of mechanisms. Lipid-soluble steroid hormones (e.g., cortisol, aldosterone, 
testosterone, or estradiol) diffuse rapidly through the acinar cells, and other molecules, 
such as dehydroepiandrosterone sulfate (DHEA-S), enter saliva by ultrafiltration through 
tight junctions. Larger molecules, such as protein hormones, are thought to be too large 
to reach saliva via these routes, with contamination from serum through oral wounds 
playing a major role in their detection in saliva (Kaufman & Lamster, 2002; Vining, 
McGinley, & Symons, 1983). These differences have important implications for the use 
of salivary biomarkers in ambulatory PNE (aPNE) studies. 


Saliva Collection Methods, 
Storing, and Biochemical Analysis 


Saliva can be collected by passive sampling (drooling or spitting) or by techniques based 
on absorbent collection material. 


Passive Sampling 


Passive drooling or spitting (see Rohleder & Nater, 2009, for details) has the advantage 
of sampling whole saliva, but it is likely to interfere with everyday life, and participants 
might not feel comfortable using this method, particularly in social situations. 


Using Absorbent Collection Material 


A number of saliva sampling techniques based on absorbent material have been devel- 
oped. Probably the most popular is the Salivette produced by Sarstedt (www.sarstedt. 
com); a similar device is offered by Salimetrics (www .salimetrics.com). These devices 
consist of a swab of absorbent synthetic material! provided in a plastic cryovial or coni- 
cal tube. For sampling, the research participant opens the container, places the swab in 
the mouth, waits until it is soaked with saliva (approximately 1-2 minutes), and puts it 
back into the container. Instructing the participants to chew on the Salivette facilitates 
saliva flow and quickens the process but should be used with caution depending on the 
biomarker to be assessed (see below). This method is easy to learn, is relatively clean, 
and can be employed unobtrusively even in social situations; it is therefore suitable for 
ambulatory assessment studies. 

Although Salivettes can be used for a wide variety of sampling situations, there are 
special circumstances in which alternative techniques need to be employed, for example, 
when collecting samples in infants and children. In recent years, special devices have been 
developed and tested, and are now offered commercially for use in such research situa- 
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tions. For example, Salimetrics offers long synthetic swabs that can be held while in the 
child’s mouth (Salimetrics Children’s Swab) or small sponges on a shaft (Sorbette) that 
have been shown to collect sufficient amounts of saliva for the assessment of biomarkers 
(de Weerth, Jansen, Vos, Maitimu, & Lentjes, 2007; Donzella, Talge, Smith, & Gunnar, 
2008). 


Stimulation of Saliva Flow 


Saliva collection can be facilitated by stimulation of saliva flow, which can be achieved 
by gums, soaking Salivettes in stimulating fluids, or chewing on the Salivette. It has been 
shown that stimulants can affect the concentrations of biomarkers in saliva. Relatively 
minor effects have been reported for salivary cortisol, but the use of stimulants should 
be held constant within studies (Talge, Donzella, Kryzer, Gierens, & Gunnar, 2005). 
Stimulation of saliva flow results in a change of the relative contribution of glands to 
saliva secretion, which is important for some salivary biomarkers, such as salivary alpha- 
amylase. 


Storing of Samples 


Depending on the sampling design, participants in ambulatory assessment studies might 
not have the opportunity to return their samples to the laboratory for some time. In 
addition, in most cases, saliva samples are transferred from the ambulatory assessment 
laboratory to a biochemical laboratory for analysis, and the samples might therefore be 
stored for some time. Although studies have shown that some salivary biomarkers are 
remarkably stable even when samples are stored up to some weeks at room temperature 
or under conditions that mimic mailing (e.g., Aardal & Holm, 1995; Clements & Parker, 
1998; DeCaro, 2008), this is not necessarily true for all biomarkers and sampling situ- 
ations (Whembolua, Granger, Singer, Kivlighan, & Marguin, 2006). In the absence of 
a “gold standard,” it is suggested that participants be instructed to store samples in the 
refrigerator (to prevent growth of bacteria and mold in samples), and that samples be 
frozen at -20°C or lower temperatures when stored for a longer period. Storage specifica- 
tions and freeze-thaw cycles should be kept constant within a study. 


Biochemical Analysis 


The wide variety of assays available for the assessment of biomarkers from saliva is 
beyond the scope of this chapter. Most biochemical laboratories (e.g., hospital laborato- 
ries) should be able to assess most biomarkers in saliva. A number of specialized labora- 
tories for salivary biomarker assessment (some are linked to universities; others are part 
of a commercial company) offer biomarker assessments in saliva at a relatively low cost. 


Assessment Designs 


In contrast to other physiological measures, which are often automatic and continuous, 
assessment of hormones in saliva requires the research participant to make an assess- 
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ment at distinct time points. For every ambulatory assessment study that utilizes salivary 
biomarker assessment it is therefore important to define a sampling protocol that yields a 
data matrix suitable to answer the particular research question (see Conner & Lehman, 
Chapter 5, this volume). Here, the specification of a sampling protocol with regard to 
scheduling, temporal coverage, and assessment details is called an assessment design. In 
the context of aPNE studies, specific characteristics of the dynamics of hormone secre- 
tion need to be taken into account. Different assessment designs are suitable for different 
study aims. A frequently used classification of assessment designs that is useful in our 
context distinguishes between event-related and time-based designs (Shiffman, Stone, & 
Hufford, 2008). 


Event-Related Designs 


Event-related designs implement ambulatory assessments in association with a predefined 
event of interest (see Moskowitz & Sadikaj, Chapter 9, this volume). In the context of 
aPNE, this design would be useful to study relatively infrequent events or if there is an 
interest in higher sampling frequency to investigate responses to a specific event in higher 
resolution. For example, a series of assessments could be triggered by the participant 
when he or she is exposed to a particular type of stressor, such as peer rejection or a 
social conflict at work. In aPNE, such designs are often used with reference to the rela- 
tively clearly defined events of awakening or academic examinations. Because changes 
in hormones in response to an event are difficult to interpret without a baseline, event- 
related designs should usually be combined with a time-based design. 


Time-Based Designs 


In time-based designs, assessments are linked with specific time points during a period 
of time. The number of assessments per day and the number of assessment days per 
study can vary widely between studies. In PNE studies, at least three assessments per 
day should be implemented due to circadian rhythms of many hormones. In addition, 
it is recommended that the assessment design spans at least two assessment days due 
to the variability of hormone secretion within persons between days. For example, it 
has been shown that overall salivary cortisol in the first hour after awakening can be 
reliably estimated with two assessment days and four assessments per day, whereas sig- 
nificantly more assessment days are necessary for a reliable assessment of the salivary 
cortisol increase after awakening (Hellhammer et al., 2007). The choice of the number 
of assessments also is influenced by feasibility, participant burden, and cost calculations, 
particularly with large numbers of subjects (Adam & Kumari, 2009). Depending on the 
aim of the study, two different time-based designs can be implemented, fixed-occasion 
and variable-occasion designs. 


FIXED-OCCASION DESIGNS 


In these designs, saliva sampling is scheduled at prespecified time points during the day. 
Due to circadian rhythms in the secretion of some hormones, this is a popular design in 
aPNE. This is particularly important in three situations: 
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1. When a distinct facet of hormone secretion that uniquely occurs at a fixed time 
period is at the center of the study, for example, when assessing the cortisol awak- 
ening response, which occurs up to 45 minutes after awakening (see the section 
on salivary cortisol below). 

2. When circumstances (financial resources or participant burden or availability) 
allow collection of a single saliva sample only; due to variation of hormone levels 
within a day, a single sample has to be identically timed for a meaningful com- 
parison between subjects. 

3. When individual average hormonal output is to be compared between groups of 
subjects, for example, when assessing salivary cortisol at multiple time points 
during the day. 


Again, samples need to be identically timed for all subjects to avoid unsystematic vari- 
ance in between-subject comparisons. 

Figure 11.1 illustrates this design for salivary cortisol and salivary alpha-amylase 
samples taken directly after waking up, 30 minutes and 60 minutes later, and at hourly 
intervals between 9:00 A.M. and 8:00 P.M. The figure demonstrates the cortisol awaken- 
ing response and the circadian rhythm of both salivary cortisol and alpha-amylase. 


VARIABLE-OCCASION DESIGN 


In this design, assessments are scheduled at variable time points across the day. The aim 
is to achieve a representative sample of the subjects’ daily life. Although this sampling 
schedule would yield comparable aggregate measures of hormone output with a high 
number of sampling points, it cannot be expected to yield comparable aggregates for 
the numbers of samples that are usually possible to achieve in aPNE assessment stud- 
ies. Therefore, this design is recommended if within-subject associations are of primary 
importance, for example, in associations between exposure to everyday life stressors and 
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FIGURE 11.1. Diurnal course of salivary cortisol and salivary alpha-amylase levels (mean + stan- 
dard error of mean) in 76 healthy men and women. This is an example for data generated by a 
time-based fixed-occasion design. From Nater, Rohleder, Schlotz, Ehlert, and Kirschbaum (2007). 
Copyright 2007. Reprinted with permission from Elsevier. 
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salivary cortisol levels. To achieve coverage of the complete day, a quasi-randomization 
procedure can be used, in which randomized sampling occasions are conditionally based 
on a timing criterion being met. For example, in stratified random sampling, the day is 
divided into a number of segments equal to the number of sampling occasions, and a sam- 
pling time point is then placed randomly within a segment. Stratified random sampling 
ensures that within-day variability in hormonal output is adequately represented in the 
dataset, and that fixed and random within-subject effects (see Nezlek, Chapter 20, this 
volume) reflect associations across the full day. To avoid placing some assessments very 
close in time, with the consequence of large gaps between other assessments, it is useful 
to include a criterion that defines a minimum time interval between assessments within 
subjects, for example, 30 minutes. Stratified random sampling is an important variant 
of variable-occasion designs in aPNE when contextual or within-subject effects are of 
primary interest. 

Figure 11.2A illustrates this design for salivary cortisol assessments taken between 
10:00 A.M. and 8:00 P.M. using stratified random sampling, with minimum spacing of 30 
minutes. It can be seen that the individual assessments were taken at very different inter- 
vals and spread across the whole assessment period. In contrast to the design illustrated in 
Figure 11.1, it obviously would not be sensible to compute average levels at specific time 
points with this design, illustrating that variable-occasion designs are useful for testing 
within-subject effects, but less so for comparing hormone levels between subjects. 

The relative timing of event or momentary measures and salivary hormone assess- 
ments needs to be chosen carefully considering the dynamics of the biomarkers to be stud- 
ied. For example, cortisol peaks follow stressful events or negative subjective-emotional 
states with a lag of approximately 10-20 minutes (Schlotz et al., 2008), while alpha- 
amylase peaks usually occur much quicker (Nater & Rohleder, 2009). 
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FIGURE 11.2. Individual salivary cortisol measures on one day from a time-based variable- 
occasion design using stratified random sampling with six assessments between 10:00 A.M. (1000h) 
and 8:00 P.M. (2000h) in 23 healthy men and women. The solid black line illustrates the mean 
trend, as calculated from the fixed effects of a mixed-effects regression model. (A) Untransformed 
observations with a trend line obtained from a second-order (i.e., quadratic) polynomial regression 
of cortisol on time of day. The mean cortisol concentration across subjects and time in this sample 
was M = 3.53, with a standard deviation of SD = 2.62. (B) Log-transformed observations with 
trend line obtained from a first-order (i.e., linear) polynomial regression (M = 1.05; SD = 0.66). 
Produced from unpublished data collected for a psychoneuroendocrine ambulatory assessment 
project at the School of Psychology, University of Southampton, United Kingdom. 
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Irrespective of the specific design implemented, it is recommended to combine saliva 
sampling with a device that is preprogrammed to give signals when an assessment is to 
be done, such as a handheld computer or smartphone (see Kubiak & Krog, Chapter 7, 
this volume). Such a combination has a number of benefits. First, it potentially reduces 
interference with everyday life activities and might increase validity because participants 
are less likely to adjust their day planning and activities to the sampling schedule when 
prompted by an external trigger. Second, psychological variables, situational character- 
istics, and potentially confounding variables can be assessed at the same time. Finally, 
electronic devices can be used to check for compliance with saliva sampling. 


Compliance in aPNE 


Because ambulatory assessments place a considerable burden on subjects, missing assess- 
ments are a common problem with these research methods. In aPNE, it has been demon- 
strated that missing assessments can bias the daytime profile of hormonal output, partic- 
ularly in fixed-occasion designs (Kudielka, Broderick, & Kirschbaum, 2003). Although 
such effects might be relatively small in group comparisons, within-subject analyses are 
likely to be affected more strongly. It is therefore important to include some kind of 
compliance check in the assessment procedure. A relatively accurate method is to use 
containers with electronic caps that provide an automatic time stamp at every opening of 
the container (e.g., MEMS® monitors; AARDEX Group, Sion, Switzerland) where swabs 
are stored. Subjects should be instructed to open the box only to take out a swab and not 
to take out more than one swab. Although such records cannot confirm saliva collection, 
an unopened box very likely indicates that a sample has not been taken. 

Because the timely compliance of subjects filling in paper diaries can be very poor 
(see Shiffman et al., 2008), electronic devices should be used wherever possible. They can 
be preprogrammed to give signals at specific time points and to accept entries only within 
a limited period. Such devices also offer a method of checking compliance by prepro- 
gramming the device to display a unique code to be copied by the subject to the container 
where the swab is to be placed after saliva collection. Although accuracy is limited, this 
method has the advantages of being less expensive than using electronic monitors and 
reducing the burden to participants. 

Adam and Kumari (2009) present a useful list to optimize compliance with saliva 
sampling using alternative methods, such as providing clear instructions to participants, 
facilitating communication, and using incentives for quick and complete return of sam- 
ples. 

Finally, if saliva samples are to be taken with reference to awakening, electronic 
devices (e.g., the Actiwatch; CamNtech, Cambridge, UK) can be used to monitor sleep 
and verify time of awakening (Dockray, Bhattacharyya, Molloy, & Steptoe, 2008). 


Potentially Confounding Covariates 


Hormonal secretion is associated with a large number of time-constant and time-varying 
covariates. It depends on the research question whether such factors are considered to 
be confounders or variables of interest by design, and this distinction should guide the 
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decision on how such variables are included in assessment design and statistical models. 
During the planning phase of the study, identification of potentially relevant factors for 
the specific salivary hormone to be assessed should be based on the literature. The key 
question to be answered here is: Does a covariate potentially confound the association of 
interest? For example, in a study where within-subject associations of stress experience 
with momentary salivary cortisol levels are of primary interest, smoking is a potential 
confounder that has been shown to be associated with stress experience, as well as sali- 
vary cortisol levels. Apart from confounding associations of interest, covariates add to the 
error variance in a statistical model, hence reducing statistical power if not controlled. 

In general, the following factors are potentially relevant confounders in aPNE: circa- 
dian and ultradian rhythms of the target hormone; medical and recreational drugs; dis- 
eases; eating (meals in general and specific nutrients, such as licorice); drinking (particu- 
larly caffeine); exercise (particularly at high intensity); smoking (momentary and smoking 
status); age; gender; use of oral contraceptives; and phase of menstrual cycle. Such poten- 
tial confounding factors can be controlled by restricting the sample using exclusion cri- 
teria (which reduces the generalizability of the findings); by using specific instructions 
to prevent assessments being made in timely association with such factors (which has an 
impact on the representativeness of the assessments); or by assessing relevant factors and 
including them in the statistical model to adjust for their potential effect. 

As mentioned earlier, blood from oral abrasions can contaminate salivary hormone 
assessments, although such effects seem to differ between hormones (Kivlighan et al., 
2004). If possible, blood contamination of saliva should be avoided. While this is difficult 
to achieve in variable-occasion designs, instructions not to eat or brush teeth can be given 
in studies with fixed-occasion designs. 


Statistical Analysis 


Data Preprocessing 


Before submitting salivary hormone data to statistical analysis, they should be screened 
for outliers and distributional properties. Sometimes a dataset contains subjects with 
consistently high hormone values, which might be due to blood contamination of saliva, 
a clinical condition, or unknown factors. In addition, a small number of outliers are 
usually present within subjects, often for unknown reasons. Although such values are 
not necessarily invalid, they might heavily influence the results of the statistical analysis. 
Excluding such subjects or observations from the analysis (trimming) or setting outlying 
observations to a specific percentile of the data (Winsorization) are the most straightfor- 
ward solutions. The most widely used rule defines an outlier as any observation that is 
beyond three standard deviations from the mean of the data. With hormone data, this 
often applies to unusually high values only because the lower three standard deviations 
often include zero. For example, in the cortisol dataset shown in Figure 11.2A, two data 
points were trimmed (19.54 nanomoles per liter [nmol/L] at 1244 h [time of day] and 
16.23 nmol/L at 1,946 h) before plotting the data. 

Very often, the distribution of salivary hormones is non-normal, in most cases posi- 
tively skewed. In this case, it is useful to transform data to approximate a normal distri- 
bution, using, for example, the natural logarithm (see Figure 11.2B). Such transforma- 
tions often linearize the association between hormonal levels and time of day, leading to 
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a less complicated statistical model (see Figure 11.2A/11.2B for an example). In addition, 
observations might not fulfill the outlier criterion after transformation (see the two high 
values in Figure 11.2B that were trimmed in Figure 11.2A). However, for an adequate 
interpretation of results, it is important to remember that model parameters do not reflect 
linear associations when the outcome was log-transformed. 


Statistical Analysis 


Many datasets have been analyzed using summary indicators of repeated salivary hor- 
mone assessments, such as the area under the curve (AUC), average levels, or mean 
increases. While such indicators provide integrated single-outcome measures that are 
more convenient to handle in statistical models, data aggregation leads to loss of infor- 
mation, and summary indicators often do not adequately deal with missing values. In 
addition, they are seldom adequate for analyzing variable-occasion designs. The method 
of choice for aPNE datasets in most cases is multilevel/mixed modeling (see Hruschka, 
Kohrt, & Worthman, 2005; Nezlek, Chapter 20, this volume). In contrast to summary 
indicators and other widely used methods, such as repeated measures analysis of vari- 
ance, multilevel/mixed models can be used to analyze within-subject associations with 
adequate modeling of heteroscedasticity and autocorrelation in the error structure, and 
they retain cases with missing values in the analysis. It is often important to first model 
circadian rhythm by including a time predictor in the model. When data were collected 
on more than 1 day, three-level models (observations within days within subjects) could 
be used.? Examples of studies that used multilevel/mixed models are presented in the sec- 
tion on salivary cortisol below. Structural equation modeling often provides complemen- 
tary methods of analyzing such data sets (see Eid, Courvoisier, & Lischetzke, Chapter 21, 
this volume, and Kumari et al., 2010). 

Researchers should make sure that their design provides for adequate sample size in 
terms of participant numbers, numbers of days, and observations within days per partici- 
pant to yield sufficient power to be able to detect the effects of interest. 


Salivary Cortisol 


The glucocorticoid cortisol is the end product of the hypothalamus-pituitary-adrenal 
(HPA) axis. It is secreted from the adrenal gland in response to a signal cascade that 
starts with secretion of corticotropin-releasing hormone (CRH) from the hypothalamus, 
which then triggers secretion of adrenocorticotropic hormone (ACTH) from the pitu- 
itary. ACTH binds at its receptors in the adrenal cortex and triggers the secretion of cor- 
tisol into the blood circulation. Cortisol is a steroid hormone that enters saliva by passive 
diffusion, and its concentration in saliva is not dependent on saliva flow rate. Salivary 
cortisol is closely correlated with plasma cortisol. Different (non-cotton-based) methods 
of saliva collection do not seem to have an influence on salivary cortisol levels (Shirtcliff, 
Granger, Schwartz, & Curran, 2001). Because most of the cortisol in blood is bound to 
transport proteins (mostly corticosteroid-binding globulin, CBG) that do not enter saliva, 
salivary cortisol reflects the unbound cortisol fraction. This free cortisol, but not bound 
cortisol, elicits effects at target tissues (Kirschbaum & Hellhammer, 1994). Cortisol binds 
at two types of receptors that are widely distributed across organ systems, including the 
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brain. The assessment of cortisol levels is relevant for psychological research for a num- 
ber of reasons. First, HPA axis activity is influenced by various brain areas, particularly 
the limbic system (Ulrich-Lai & Herman, 2009), which suggests an indicator function of 
cortisol levels for psychological processes. Second, cortisol influences immunological and 
metabolic processes, which suggests a potential mediating mechanism for associations 
between psychological and disease processes. Third, cortisol enters the brain and binds 
at receptors in different brain areas, which suggests that psychological processes can be 
influenced by cortisol levels. 

However, salivary cortisol levels do not entirely reflect activity of the HPA axis at 
hypothalamic or pituitary levels (Hellhammer, Wust, & Kudielka, 2009), and therefore 
cannot be expected to correlate perfectly with brain activity in relevant areas. Such physi- 
ological and other dissociations have to be taken into account when salivary cortisol is 
studied in association with psychological processes (cf. Schlotz et al., 2008). 


Covariates 


Numerous covariates of salivary cortisol assessed in daily life have been identified and 
should be considered when planning a study. Important time-varying factors are smoking, 
alcohol use, caffeine, exercise, and meals. Figure 11.1 illustrates the potential influence 
of a covariate if it occurs at a similar time across subjects, here a postprandial cortisol 
increase at 2:00 P.M. Factors that are usually constant over the sampling period include 
gender, age, socioeconomic status, physical and mental health, medication and oral con- 
traceptive use, and menstrual cycle phase. Important exclusion criteria are steroid-based 
medications, endocrine disorders, late pregnancy, and acute illness during the testing 
period (see Adam & Kumari, 2009, for references). 


Characteristics and Indicators of Cortisol Output 


Cortisol output is characterized by a circadian pattern, with highest levels in the morning 
and lowest levels around midnight (see Figure 11.1 for an illustration of a typical diurnal 
cortisol rhythm). For this reason, time of sampling must be recorded with every cortisol 
sample and included in the statistical model as a covariate, if time is not held strictly 
constant across subjects. Due to the link with awakening, it has been argued by some 
researchers that time of collection should always be measured with reference to awaken- 
ing. However, this makes little difference if participants wake up in the morning and the 
fixed assessments are done more than 1 hour after awakening. 

Cortisol also shows an ultradian pattern that reflects pulsatile secretion (Lightman 
& Conway-Campbell, 2010). Whereas ultradian rhythms can be detected only with very 
frequent sampling, the circadian rhythm of salivary cortisol levels can and should be 
included in statistical models, even with a low sampling frequency. 

Numerous indicators of cortisol output have been used in ambulatory salivary cor- 
tisol studies. While single-time-point assessments are of limited value, indicators derived 
from repeated assessments include characterizations of the cortisol awakening response 
(CAR; see Figure 11.1 for an illustration; see Chida & Steptoe, 2009, for a meta-analysis 
on relevant psychosocial factors), the diurnal slope, total output as measured by the AUC 
(cf. Adam & Kumari, 2009), and average momentary within-subject associations of cor- 
tisol with other variables. Which aspect of cortisol output is to be sampled and which 
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indicator used in a specific study depends on the research aim, financial resources avail- 
able, and the burden to the participants. A suitable assessment design must be chosen 
accordingly (see the section on assessment designs). 


Ambulatory Salivary Cortisol and Health 


Cortisol interacts with physiological processes in many organ systems (Sapolsky, Romero, 
& Munck, 2000), even after relatively small changes in cortisol (Plat et al., 1999). A 
number of ambulatory assessment studies have demonstrated associations of salivary 
cortisol in daily life with health and disease status and potential mechanisms. Examples 
are associations between the CAR and major depression (Adam et al., 2010), as well as 
wound healing (Ebrecht et al., 2004), and between diurnal cortisol decline and coronary 
calcification (Matthews, Schwartz, Cohen, & Seeman, 2006). 


Ambulatory Salivary Cortisol and Stress 


The HPA axis, one of the major stress response systems (Chrousos, 2009), provides an 
important mechanistic link between stress and disease. For these reasons, salivary corti- 
sol has been most frequently studied in association with stress. Besides laboratory-based 
research (Dickerson & Kemeny, 2004), a wide range of methods has been used to study 
such associations in daily life. A large group of studies used indicators of facets of cortisol 
output in daily life, such as CAR, AUC, or mean cortisol levels, and found mostly posi- 
tive associations with retrospective measures of stress (e.g., Schlotz, Hellhammer, Schulz, 
& Stone, 2004; Steptoe, Siegrist, Kirschbaum, & Marmot, 2004). Such associations are 
thought to reflect long-term adaptation of the organism to stress and have been shown to 
be relevant for a variety of health risk factors (Rosmond, Dallman, & Bjorntorp, 1998). 

However, such studies rely on potentially biased retrospective reports, and they pro- 
vide no information on the within-subject covariance between stress or stress-related 
variables and cortisol output. Although most relevant for the dynamics of daily life stress 
and its potential impact on health, few studies have used a within-subject design (Hanson, 
Maas, Meijman, & Godaert, 2000; Jacobs et al., 2007; Peeters, Nicholson, & Berkhof, 
2003; Schlotz, Schulz, Hellhammer, Stone, & Hellhammer, 2006; Smyth et al., 1998; 
van Eck, Berkhof, Nicolson, & Sulon, 1996). All of these studies found positive within- 
subject associations between daily life stress or negative mood and salivary cortisol, and 
the study by Jacobs and colleagues (2007) suggests that negative mood mediates stress— 
cortisol associations in daily life. These results are promising and suggest that applying 
these designs to the study of related research questions might yield valuable insights into 
psychoneuroendocrine interactions in daily life. 


Other Salivary Hormones 


DHEA/DHEA-S 


Dehydroepiandrosterone (DHEA) and its sulfated metabolite (DHEA-S) are steroid hor- 
mones secreted from the adrenal cortex in response to ACTH. DHEA is normally secreted 
synchronously with cortisol, although dissociations are common (see Kroboth, Salek, Pit- 
tenger, Fabian, & Frye, 1999, for a review). DHEA/DHEA-S levels in blood show diurnal 


Ambulatory Psychoneuroendocrinology 205 


variation and associations with age and gender, increase with exercise, and have been 
associated with several mental and physical disorders. Salivary DHEA correlates highly 
with DHEA in serum (Granger, Schwartz, Booth, Curran, & Zakaria, 1999). DHEA-S 
concentrations in saliva depend on saliva flow rate (Vining et al., 1983). Salivary DHEA, 
but not DHEA-S, seems to be dependent on the collection method (i.e., stimulated or 
unstimulated swab or passive drool) (Gallagher, Leitch, Massey, McAllister-Williams, 
& Young, 2006; Shirtcliff et al., 2001; Whetzel & Klein, 2010). A few studies have 
investigated associations between DHEA/DHEA-S and psychological factors in daily 
life, with often inconsistent findings. Although salivary DHEA levels have been shown to 
be responsive to psychosocial laboratory stress (Izawa et al., 2008), no study has investi- 
gated associations with daily life stress. 


Testosterone and Other Sex Steroid Hormones 


The lipid-soluble steroid hormones testosterone and estradiol or estriol enter saliva by 
passive diffusion through acinar cells and are independent of flow rate (Vining et al., 
1983). Whereas salivary testosterone is closely correlated with the free hormone frac- 
tion in serum (Granger, Schwartz, Booth, & Arentz, 1999), correlations for estradiol are 
age- and gender-dependent (Shirtcliff et al., 2000). Saliva collection must not be done by 
cotton-based methods (Groschl & Rauh, 2006; Shirtcliff et al., 2001), and other precau- 
tions should be followed (Granger, Shirtcliff, Booth, Kivlighan, & Schwartz, 2004). With 
regard to psychoendocrine interactions, salivary testosterone has been studied mostly in 
relation to motivational situations, competition, and aggression (not reviewed here). A 
number of other sex hormones that can be assessed in saliva are not discussed here. 


Oxytocin 


Attempts have been made to measure larger molecules in saliva, such as oxytocin, which 
would be an interesting hormone in aPNE due to its correlations with social behav- 
ior. However, such molecules do not passively enter saliva. Therefore, salivary oxytocin 
concentrations are very low, often below detectable values (Horvat-Gordon, Granger, 
Schwartz, Nelson, & Kivlighan, 2005), and it is likely that they are heavily dependent 
on blood contamination of saliva samples. Although successful measurement of salivary 
oxytocin has been reported recently using an enzyme immunoassay (Carter et al., 2007), 
more research is needed to understand better the validity and utility of such measures. 


Other Salivary Biomarkers 


Saliva contains a large number of molecules with potential as biomarkers. Some non- 
hormonal salivary biomarkers are briefly presented here, selected on the basis of their 
potential usefulness for daily life research. 


Alpha-Amylase 


In recent years, the measurement of salivary alpha-amylase (sA A) has become increasingly 
popular in aPNE. AA, an enzyme secreted with saliva, is thought to reflect sympathetic 
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nervous system activity (Nater & Rohleder, 2009). Due to its mechanism of secretion 
into saliva, specific methodological considerations have to be followed to achieve valid 
assessments of sAA (Rohleder & Nater, 2009). Figure 11.1 illustrates diurnal variability 
in sAA, and several other factors also need to be controlled when this salivary biomarker 
is used for research in daily life. Although increased sAA levels have been observed in 
response to laboratory stress (see Nater & Rohleder, 2009, for a review), more research 
is needed into within-subject associations in daily life. 


Immunological Markers 


Salivary immunoglobulin A (sIgA) is a marker of mucosal immunity. It can be assessed 
with (non-cotton-based) swabs (Strazdins et al., 2005) but is dependent on saliva flow rate. 
Reduced sIgA levels have been reported in association with stress and negative mood in daily 
life (Deinzer & Schuller, 1998; Stone, Cox, Valdimarsdottir, Jandorf, & Neale, 1987). 


Cotinine 


Cotinine, the primary metabolite of nicotine, can be assessed in saliva (Shirtcliff et al., 
2001) and has been successfully used as a biomarker for objective assessment of aggre- 
gated exposure to tobacco smoke (Granger et al., 2007). 


Summary and Recommendations 


In summary, the measurement of hormones in saliva is a useful method for studying 
psychoneuroendocrine processes in daily life. Salivary cortisol is the most widely used 
method due to its relatively straightforward measurement and its relevance for psycholog- 
ical and physiological processes. A number of methodological issues need to be taken into 
account when designing a study. The method and timing of saliva sampling and adequate 
handling of potentially confounding factors are of particular importance. 


Notes 


1. It should be noted that cotton seems to interfere with salivary assessment results for a number of 
biomarkers (Groschl & Rauh, 2006; Harmon, Hibel, Rumyantseva, & Granger, 2007; Shirtcliff et 
al., 2001). Because synthetic materials are readily available and show better recovery rates for many 
biomarkers, cotton-based devices should best be avoided. 

2. Although such data could in principle be (and sometimes have been) modeled with a two-level model 
by ignoring the day level, three-level models are adequate representations of the data structure. They 
offer the advantage of modeling between-day variability and increase the power of detecting within- 
subject associations by reducing error variance. 
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CHAPTER 12 


Bridging the Gap between 
the Laboratory and the Real World 


Integrative Ambulatory Psychophysiology 
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PAUL GROSSMAN 
MAREN |. MÜLLER 


n this chapter, shortly we introduce the scientific discipline of psychophysiology before 

we give an overview of psychophysiological assessment in real life. We then contrast the 
paradigms of ambulatory assessment and field experimentation with the very common 
laboratory experimentation. We consider the differential benefits of these approaches, 
as well as their unique pitfalls and difficulties, and argue for use of laboratory and 
field approaches in conjunction, as they are fundamentally complementary research 
approaches (Patry, 1982). We point out the necessity of (1) more frequent use of ambula- 
tory approaches and (2) development of new, combined research strategies in order to 
gain data and insight from different angles—strategies that are validated in the labora- 
tory and close to daily life conditions at the same time (Fahrenberg, Myrtek, Pawlik, & 
Perrez, 2007). Finally, methods and instruments of ambulatory physiological measure- 
ment are presented. Difficulties of data collection and interpretation that are unique to 
each method are discussed, and possibilities to obviate them are presented. 


Psychophysiology as a Research Discipline 


Psychophysiology is a branch of psychology concerned with the physiological bases of 
psychological processes and endeavors to understand interactions of these processes 
using noninvasive methods, primarily in humans. Psychophysiology is positioned at the 
intersection of psychological and medical science, and its popularity and importance 
have expanded with the increasing realization of a pervasive interrelatedness of mind 
and body. Assessment of psychophysiological functioning opens unique ways to under- 
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stand perceptual, emotional, and cognitive processes, and provides insights that cannot 
be obtained by interview, questionnaire, or behavioral observation. However, self-reports 
(see Gunthert & Wenze, Chapter 8, and Moskowitz & Sadikaj, Chapter 9, this volume) 
and behavioral measures (see Mehl & Robbins, Chapter 10; Bussmann & Ebner-Priemer, 
Chapter 13, and Goodwin, Chapter 14, this volume) provide an important context for 
understanding complex psychophysiological responses and general organismic function- 
ing, especially outside the laboratory. They are often obtained in conjunction with physi- 
ological data (Wilhelm, Schneider, & Friedman, 2005). 

Historically, psychophysiologists first targeted autonomic functioning. About 150 
years ago they started investigating pulse, respiration, and electrodermal activity, then 
blood pressure, heart activity, and brain waves in response to emotion-eliciting or cog- 
nitive tasks. More recently scientific interest has shifted, and psychophysiologists today 
are equally or even more interested in the central nervous system. They explore cortical 
brain activity using methods such as brain waves (electroencephalography [EEG]), event- 
related potentials (ERPs), and brain scans (utilizing advanced neuroimaging methods 
and technologies; e.g., functional magnetic resonance imaging [fMRI]). The application 
spectrum for psychophysiological assessment is broad and ranges from medical fields, 
such as cardiology, chronobiology, and psychosomatic and behavioral medicine, to stress 
research, sleep research, and the psychological disciplines of clinical psychology, social 
psychology, and work and organizational psychology. Research topics are, among oth- 
ers, the relation between physiological and psychological aspects of neuronal activity and 
behavior during attentional processes, sensory stimulation, cognitive tasks, emotions, 
stress, pain, and learning processes (e.g., conditioning, habituation, and extinction). 

A variety of measurement techniques exist to predominantly noninvasively capture 
measures of different physiological systems, such as central nervous system activity (elec- 
troencephalography; magnetoencephalography [MEG]; magnetic resonance imaging 
[MRI]), cardiovascular activity (electrocardiography [ECG]; blood pressure [BP]; pulse 
volume amplitude [PVA]), respiration (pneumography, capnography), skin temperature, 
electrodermal activity (skin conductance level [SCL] or response [SCR]), muscle activ- 
ity (electromyography [EMG]; accelerometry), oculomotor activity (electrooculography 
[EOG]; pupillometry), and speaking activity. The determination of salivary cortisol or 
alpha-amylase levels as indices of stress (hypothalamic-pituitary-adrenal and sympatho- 
adrenal axes) and hormonal or immunological analyses of blood samples are further pro- 
cedures that complement physiological, behavioral, or self-report measures (for details, 
see Schlotz, Chapter 11, this volume). In psychophysiological research, typically diverse 
physiological systems are recorded simultaneously by means of multichannel monitors 
and polygraphs. 


Psychophysiological Assessment in Real Life 


Ambulatory assessment and ambulatory monitoring represent rather recent research 
approaches that offer interesting new opportunities for studying daily life activities 
outside the laboratory or hospital. Ambulatory assessment denotes the “acquisition of 
psychological data and/or physiological measures in everyday life (i.e., natural settings) 
according to an explicit assessment strategy which relates data, theoretical constructs, 
and empirical criteria specific to the given research issue” (Fahrenberg, 1996, p. 4). Pro- 
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cesses that are continuously recorded or systematically observed with a view to intervene 
for corrective action are referred to as monitoring. Monitoring can be either stationary 
(bedside/“wired”) or ambulatory, with the latter enabling subjects to move freely and 
pursue their daily business while being assessed. 

About 30 years ago, portable microcomputers first became available. Holter (1961) 
developed an ambulatory ECG monitor; a few years later, ambulatory devices for semi- 
automatic and automatic noninvasive registration of arterial blood pressure (Schneider, 
1968), for EEG recording (Ives & Woods, 1975), and for sleep recording (Wilkinson, 
Herbert, & Branton, 1973) came into use. Initially, the technology found its application 
in medicine, wherein it has progressed most rapidly. The practical benefit is evident: It 
serves to monitor physiological symptoms and bodily functions, such as cardiac activity, 
blood pressure, and physical activity patterns in daily life (e.g., Bussmann & Stam, 1998; 
Haynes & Yoshioka, 2007; Verdecchia, Angeli, & Gattobigio, 2004), to obtain infor- 
mation relating to diagnoses of patients at risk (e.g., with cardiovascular diseases), to 
assess the effects of diverse medical procedures, to evaluate progresses in treatment, and 
to monitor medication effects. For many years, 24-hour monitoring of ECG and blood 
pressure have been indispensable routine methods in clinical diagnostics and treatment. 
More recently, electronic diaries (see Gunthert & Wenze, Chapter 8, this volume) and 
other forms of ambulatory “patient-reported outcome” data have been used increasingly 
in clinical trials, with the intention to overcome retrospective bias and other sources of 
measurement error in the evaluation of treatments (e.g., Garcia et al., 2007). 

Later, psychological scientists also began employing portable recording systems to 
capture relevant physiological processes in the context of real life and its manifold natu- 
ral stimuli. Often, they specifically targeted the major biological, emotional, and stress 
response systems, including the sympathetic-adrenal-medullary system (many cardio- 
vascular measures; e.g., blood pressure, preejection period, pulse wave amplitude, elec- 
trodermal activity), the vagal system (a few cardiovascular measures; e.g., respiratory 
sinus arrhythmia [RSA]), the respiratory system (many measures of respiratory pattern, 
some measures of gas exchange), and the hypothalamic-pituitary-adrenal axis (cortisol). 
Recently, tools for the naturalistic observation of daily acoustic environments have also 
been used. The registered speaking activities and ambient noise that provide information 
on social context, social interactions, and individual habits may, for example, give clues 
about changes in mood in patients with affective disorders (Mehl & Holleran, 2007; see 
Mehl & Robbins, Chapter 10, this volume). 

Modern portable electronic recording devices can duplicate the typical set of chan- 
nels employed in a psychophysiological research laboratory and are suited for practical 
application in a wide range of fields. In clinical research and practice, ambulatory tech- 
niques are appropriate for all cases in which patients exhibit significant pathological 
symptoms that cannot be detected as reliably in a laboratory or hospital as compared to a 
prolonged observation in everyday life (Fahrenberg, 1996). These cases include ventricu- 
lar arrhythmia and ischemic episodes, epilepsy, sleep disorders, and hypertension. 

Processes related to circadian rhythm can also be investigated by employing ambu- 
latory strategies. In sleep research, for example, measures of actigraphy, EEG, EOG, 
and EMG are employed to infer sleep efficiency and sleep staging in the home setting 
of patients. The temporal fluctuations of autonomic and respiratory regulations are tar- 
geted in chronomedicine, where, for example, information on the circadian variation of 
hypertension can help in titrating medication for maximal potency. Endocrinological 
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regulation can be reflected in daily profiles of salivary cortisol. Social interaction can, at 
least indirectly, be deduced from recordings of breathing patterns associated with speak- 
ing activity. 

Within the disciplines of clinical and occupational psychology are a variety of rel- 
evant health- and work-related issues that imply surveying in real life. Ambulatory assess- 
ment or monitoring may be expedient for people in hazardous workplaces or for patients 
with certain diseases, such as chronic pain or panic disorder (for an overview of applica- 
tions in clinical psychology and psychiatry, see Wilhelm & Perrez, 2008). A considerable 
number of clinical studies, as well as studies on work stress and strain, already have 
employed the ambulatory paradigm to investigate the association of physiological and 
emotional parameters outside the laboratory (e.g., Conrad, Wilhelm, Roth, Spiegel, & 
Taylor, 2008; Ebner-Priemer et al., 2008; Wilhelm & Roth, 1997, 1998a). 

Since psychophysiological theories and research methods have already penetrated 
various fields of psychology and neighboring sciences, and because easy-to-use, high- 
technology tools for ambulatory assessment have become available, ambulatory psy- 
chophysiological approaches now fruitfully complement classical laboratory experimen- 
tation in various fields of research. 


Lab versus Real Life: Integrating Disparate Approaches 


Currently, many psychological and psychophysiological issues can be investigated only in 
the constrained settings of a laboratory in an intractable way; however, there are clearly 
phenomena that require examination under real-life conditions. For example, psycho- 
logical factors in cardiac arrhythmias and epileptic seizures, mood and fatigue, stress in 
the workplace, and emotions in the family context should, for a number of reasons, be 
investigated outside a laboratory or clinic setting. For this reason, ambulatory assessment 
strategies are indicated. 

Although technological innovations now permit detailed study of psychophysiologi- 
cal phenomena in naturalistic settings, ambulatory strategies until recently have seldom 
been employed. Thus, a huge proportion of scientific knowledge on human physiology, 
experience, and behavior is based on laboratory research or inferred from retrospec- 
tive reports. In the past, these strategies were the most feasible—if not almost the only 
ones—and they still possess distinct advantages. The experimental laboratory approach 
is characterized by constrained conditions and can provide systematic and reliable infor- 
mation on functional relations between particular stimuli and elicited reactions. It allows 
for systematic variation of independent variables (IVs) in order to identify the effects on 
dependent variables (DVs); thus, it is specifically good at addressing research questions 
on causes and mechanisms. 

The experiment remains the “gold standard” for concise testing of hypotheses under 
the most stringent possible methodological isolation of the phenomenon in question 
(Fahrenberg et al., 2007). By repeating experiments in exactly the same fashion sev- 
eral times, reproducibility of results can be evaluated. Through continuous supervision, 
compliance of participants with experimental instructions is ensured. Moreover, physi- 
ological measures contain relatively few artifacts from participant movement or technical 
problems. Psychophysiological laboratory research is relatively inexpensive in terms of 
both technology and analysis time. Questionnaires and interviews can complement the 
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laboratory approach well. Devoid of much subject burden, they can be applied quickly 
and cost-effectively, and provide information on individual experience, self-observed 
behavior, and characteristics of personality, attitude, emotions and cognitions that may 
be relevant for understanding biobehavioral phenomena. 

There are, however, limitations of laboratory psychophysiological research, which 
can be demonstrated by data from our recent mental stress study. Heart rate (HR), which 
is not only an indicator of physical exertion but also a good index of intensity of mental 
stress and anxiety, was measured with a LifeShirt (Vivometric, now Vivonoetics, Inc.) 
system (Wilhelm, Roth, & Sackner, 2003) during a laboratory protocol that comprised 
five stressors with increasing mental task demands over time, including reading, an audi- 
tory attention task, a memory comparison reaction time task, simulated public speaking, 
and a computer math task. Following this morning laboratory session, participants went 
off to follow their normal daily activities. We controlled HR data for physical activity 
levels. Figure 12.1 depicts the course of HR in the laboratory and in real life, adjusted 
for physical activity, for a 27-year-old female participant. The brief, task-related peaks 
seen during the “stress” conditions were typically small, on the order of 5-10 bpm (beats 
per minute), and were clearly offset in significance by the effects of sitting still from 
9:00-11:00 A.M., which reduced HR by about 20 bpm. Interestingly, despite enhanced 
mental stress, HR actually diminishes across time (possibly reflecting habituation). More 
important, however, when an emotional stimulus specifically relevant to the participant, 
in this case, a game of her national soccer team on TV, was broadcast at home, her HR 
was elevated remarkably (up to 50 bpm) for over an hour despite her semisupine sitting 
position. Another example of remarkably elevated HR in a clinical assessment in real life 
is provided in Figure 12.2. These examples demonstrate that real-life stress can be much 
more intense than that observed in the laboratory, and it casts doubt on the assumption 
that laboratory stress results can be generalized to real-life stress. 

In fact, there is growing concern about low external validity of laboratory findings. 
One simply cannot be sure of the relevance of laboratory assessment without compari- 
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FIGURE 12.1. Heart rate responses (1-minute averages) of a study participant monitored with the 
LifeShirt during and after a laboratory stress protocol consisting of five resting baselines and mild- 
to-moderate mental stressors (“Lab”). Laboratory stress responses and responses to the movie 
were small compared to responses to a soccer game the participant watched at home (*). Note. 
Heart rate was adjusted for ongoing physical activity. From Wilhelm and Grossman (2010). Copy- 
right 2010. Reprinted with permission from Elsevier. 
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FIGURE 12.2. Annotated ambulatory LifeShirt recording of patient with a diagnosis of social 
phobia before, during, and after a 10-minute business meeting presentation. A small subset of 
recorded and analyzed parameters is displayed: M%RC, minute-by-minute median of rib cage 
contribution to tidal volume; Vent, breath-by-breath continuous estimate of minute ventilation; 
HR, beat-by-beat continuous estimate of heart rate; AccM, motion component of accelerometry. 
From Wilhelm and Grossman (2010). Copyright 2010. Reprinted with permission from Elsevier. 


son with data from real life. An increasing number of studies consequently show that 
ambulatory strategies can add crucial value for understanding real-life behavior—which 
undoubtedly is the declared aim of the corresponding sciences. The increasing availability 
of alternative high-technology solutions makes ambulatory assessment a desirable option 
for a variety of research questions (Grossman, 2004; Maheu, Pulier, Wilhelm, McMe- 
namin, & Brown-Connolly, 2005; Wilhelm & Roth, 1996). It is less constrained than 
laboratory strategies. In addition, providing naturalistic data from assessments made dur- 
ing everyday life, enabling long-term recording and accounting for person-situation inter- 
actions, has the potential to overcome many validity problems of laboratory research. 


Ambulatory Assessment Provides Naturalistic Data 


Though not always evident at first sight, researchers are likely to miss important aspects 
of psychophysiological functioning that may be central to life outside the laboratory when 
they rely exclusively on laboratory data. In the medical literature the effect of “white 
coat hypertension” is an instructive example, illustrating that conclusions drawn from 
office-based blood pressure readings can be misleading or even dangerous because some 
patients demonstrate high blood pressure in the clinic but not at home. Consequently, 
although they are not hypertensive in everyday life, they may receive medication (see 
Pickering, 1991). 
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In ambulatory research, physiological processes are recorded while the subject pur- 
sues normal daily activities. Laboratories and clinics, on the contrary, are unfamiliar 
places, and their potentially threatening nature can interfere with the natural unfolding 
of the processes in question. Because of subjects’ emotional reactivity to the setting, base- 
line measurements in a laboratory or clinic may not closely represent real-world function- 
ing. Several studies have demonstrated that laboratory-to-life generalizability of blood 
pressure levels during baseline periods or stress reactions is rather modest (Kamarck, 
Schwartz, Janicki, Shiffman, & Raynor, 2003). 

Nevertheless, the most basic paradigm is the resting baseline, and a majority of 
clinical studies, as well as diagnostic procedures (e.g., blood pressure and blood draws), 
take place in laboratories, physicians’ offices, or hospitals. Physiological registrations are 
acquired during sedentary baseline periods characterized by low levels of motor activity. 
These immobile conditions are tacitly assumed to reflect quiet rest in daily life, and any 
abnormality seen there is consequently interpreted as evidence of a diagnostic feature, 
a pervasive trait, or a constitutional abnormality. This can result in overestimation of 
pathological features and poses a serious problem related to validity in some disorders 
(Wilhelm & Roth, 2001). 

In research, scientific assessment during supposed baseline “resting” is often imme- 
diately followed by stressful procedures or tasks as part of the study protocol. It seems 
obvious that certain subjects, such as patients with affective and anxiety disorders, are 
more prone than others to demonstrate symptoms of anxiety and stress in reaction to 
the strange and potentially intimidating environment of a laboratory, on the one hand, 
and the upcoming procedures to which patients anticipate personal vulnerability, on the 
other. Thus, it is not surprising that prechallenge baseline values indicating anxiety often 
are systematically elevated compared to postchallenge values (Alpers, Wilhelm, & Roth, 
2005; Wilhelm & Roth, 1998b). 

The anticipation of impending tasks or procedures, often empirically selected to 
elicit undesired emotional states, certainly cannot be considered a baseline resting state. 
Nor may the reactions of patients constrained to perform certain tasks in a laboratory be 
representative of normal function—acute anxiety episodes elicited in the laboratory may 
or may not represent what is going on in real life. And a brief sedentary measurement 
period cannot be employed as a neutral baseline either—even when it is not followed by 
a challenging task, given likely interindividual differences in perception of the laboratory, 
especially between people with and without psychological disorders. During everyday 
life, on the other hand, patients may often experience neutral and calm states under 
conditions perceived as secure and safe. Conversely, healthy individuals certainly also 
show extreme emotional reactions to specific stimuli they find stressful or frightening. 
Hence, often elevated physiological or behavioral indices of anxiety or other emotions 
may merely be an artifact of the laboratory situation. 

Presumably, it is particularly the inherent artificiality of assessment conditions that 
places the validity of laboratory measurement at risk. A multitude of factors likely keep 
physiological processes from unfolding naturally, including the mere presence of specific 
and alien technical apparatuses in proximity to the subject, the actual instrumentation of 
study participants for purposes of physiological registration (e.g., being placed in a fMRI 
tube or tethered to an ECG), the unusual requirement of remaining almost completely 
immobile for at least several minutes (often during situations in which physical activity 
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might be a natural and adaptive response), the awareness of being continuously observed, 
and/or the content and presentation of experimental instructions. 

In basic research, psychophysiology is mostly used in reactivity paradigms, in which 
a stimulus, a task, or a psychologically meaningful situation is presented, and reactions 
to it are quantified. A multitude of stimulus presentation and behavioral challenge para- 
digms have been developed and are well standardized (e.g., a videoclip to induce a certain 
emotion, or spiders to induce anxiety in phobic patients). 

Such stimuli and elicitation techniques may work reliably and induce important 
aspects of genuine experience and behavior—evidenced, for example, by the self-report 
of participants and objective indicators, such as crying during sadness or physiological 
changes during phobic reactions. However, one has to keep in mind that these stimuli 
were created artificially and implemented in the laboratory context, and often they are 
presented in a caricature-like manner that effectively trades off aspects of real life for 
experimental control. Whether the elicited behaviors are identical to real ones is ques- 
tionable. Our mental stress example has demonstrated already that laboratory stressors 
are not necessarily comparable to stressors in real life—either in quantity or in quality of 
the reactions they elicit. The artificiality of stimuli, their lack of representativeness, and 
their insufficient relevance to the individual can compromise external validity of labora- 
tory research. 

At this point the complementary use of laboratory and ambulatory approaches can 
be pertinent for modeling real-life stimuli for representative and more meaningful labora- 
tory research designs. Whereas it may appear relatively easy to find eligible stimuli (e.g., 
spiders in research on arachnophobia) for some issues, it is considerably more difficult 
and challenging to find others. Randomly sampling stimuli from the environments of 
individuals can help to model the types of tasks people face in their daily lives. Using the 
example of tunnel phobia, cross-fertilizing effects between the paradigms can be illus- 
trated. In order to find out which specific aspects of a tunnel transit really do elicit anxi- 
ety, it may be a reasonable strategy to send patients through a tunnel (or better, several 
tunnels featuring different constitutional characteristics; e.g., length, width, height, illu- 
mination, speed limit), in the course of a field diagnostic test, and assess their perceptions 
consecutively. The collected data can provide useful information and help to create labo- 
ratory stimuli in which environmental properties are preserved, and ultimately increase 
ecological validity of laboratory research findings, an approach Brunswik (1955) has 
recommended to cognitive psychologists. Moreover, and no less important, the gained 
insights may in turn prove useful in the clinical context, enhancing diagnosis and treat- 
ment. 

Many (healthy) subjects, on the contrary, are very aware that the experimental 
situation—regardless of how representative, relevant, and close to real-life the labora- 
tory setting may be—is not in fact real, and that an actual threat is proscribed by ethi- 
cal requirements placed on the researcher. In addition, the mere awareness or knowl- 
edge of being continuously observed can be problematic. One could argue that just as 
in quantum physics, where methods of quantification influence the characteristics of 
what is being observed (i.e., particles vs. waves; Greiner, 2001), experimental laboratory 
research employs methods that may profoundly alter what researchers are trying to com- 
prehend (i.e., human behavior as it commonly occurs outside the laboratory). Thus, even 
the most sophisticated experimental design and measurement may not accurately reflect 
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psychological and physiological processes during daily life, although they may precisely 
characterize what takes place in a rarified realm that sometimes bears little, or only 
modestly overlapping, resemblance to ordinary situations in the real world. Ambulatory 
approaches, in turn, being less burdensome and enabling long-term recordings, have the 
potential to overcome what might be termed a “psychological uncertainty principle” and 
enable assessment without changing what is being assessed (cf. Heisenberg’s uncertainty 
principle in physics). Thus, the degree of stability and representativeness of psychophysi- 
ological reaction patterns in real life can be determined. 

Psychology and neighboring sciences urgently need to develop clear and well-designed 
research programs that employ laboratory strategies in conjunction with real-life assess- 
ments. The traditional sequence of scientific inquiry in clinical psychology and psychiatry 
commonly begins with a patient’s self-report of a phenomenon that is not yet researched 
sufficiently (e.g., a panic attack). Then, questionnaires are employed, interviews are con- 
ducted, and experts’ opinions are obtained to judge the relevance and relation of the 
self-report to a certain diagnosis. Finally, experimental laboratory assessment paradigms 
are utilized to search for causes and mechanisms. We propose an alternative sequence, 
in which ambulatory assessment is subsequent to the patients’ self-report to validate sub- 
jective and retrospective descriptions, and to gather additional information mainly on 
antecedents, consequences, and context of the phenomenon (e.g., the panic attack), and 
(optionally) also on cues of potential causes and psychophysiological mechanisms. Based 
on a sound footing, ecologically valid laboratory paradigms with more focused hypoth- 
eses can then be developed. 

This alternative course of action also has an economic rationale. Instead of investing 
large sums of time and money in the replication of laboratory experiments to confirm 
results whose generalizability is doubtful, more effort should be focused on the external 
validation of early laboratory findings by putting positions and evidence derived from 
laboratory assessment to a “real-life” test. 

Ambulatory approaches have the potential to demonstrate the inadequacy of specific 
laboratory findings and challenge particular theories based on laboratory research. Con- 
sequently, ambulatory strategies may elaborate or modify theoretical models in impor- 
tant ways. For example, this has been the case for theories on the origin of panic attacks. 
Symptomatic and physiological respiratory alterations in laboratory panic induction 
studies have been found repeatedly; therefore, biological theories have postulated a pri- 
mary role of respiratory factors (Klein, 1993; Ley, 1985). However, ambulatory studies 
have shown that genuine panic attacks in everyday life are often much less severe com- 
pared to what is observed in the laboratory, in terms of both experienced symptoms and 
physiological changes (Garssen, Buikhuisen, & van Dyck, 1996; Margraf, Taylor, Ehlers, 
Roth, & Agras, 1987). This fact has subsequently strengthened cognitive accounts of 
panic (see Roth, Wilhelm, & Pettit, 2005). Additionally, very recent ambulatory evidence 
seriously challenges the idea that respiratory factors play a major role at all in panic disor- 
der (Pfaltz, Grossman, Michael, Margraf, & Wilhelm, 2010; Pfaltz, Michael, Grossman, 
Blechert, & Wilhelm, 2009). 

This example also demonstrates how deficient our scientific knowledge about real- 
life behavior actually is. Due to the rare utilization of ambulatory strategies so far, we 
actually know little about human behavior in terms of psychophysiological functioning 
(and malfunctioning) in the multifaceted situations of everyday life. To date, ambula- 
tory psychophysiological research has largely been applied to clinical issues, where it has 
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indeed improved understanding of a number of disorders. Yet a good portion of human 
behavior is still a mystery; thus, it is time to also utilize the ambulatory paradigm to 
explore genuine human psychological and physiological functioning. For example, we 
know little about how people behave and what they do in terms of respiratory function- 
ing when they encounter the multiple challenges of everyday life. Long-term assessment 
of physiological activities in situ can provide normative data on known phenomena, and 
it has the potential to identify and characterize new phenomena. 


Ambulatory Assessment Enables Long-Term Recording 


The extended temporal observation window fostered by ambulatory assessment is pre- 
destined to examine phenomena so rare that they are unlikely to happen spontaneously 
in a laboratory. Especially in the clinical context, a greater time scale can be of particular 
importance when it comes to assessing certain pathological symptoms, such as epileptic 
seizures or panic attacks. Likewise, the 24-hour ambulatory monitoring of ECG, also 
called ”Holter monitoring,” is very useful for observing occasional cardiac irregulari- 
ties that would be difficult to identify in a shorter period of time. Technologically, much 
progress has been made to enhance real-time analyses of essential ECG changes, such as 
segment depression and arrhythmias (Fahrenberg, 1996). 

The extended time window of real-life assessment also enables investigation of 
circadian profiles of activation. In sleep research, for example, nocturnal measures of 
actigraphy, EEG, EOG, and EMG are employed to infer sleep efficiency and sleep stag- 
ing. In chronomedicine, the temporal fluctuations of autonomic regulations are targeted 
because information on the circadian variation of outcomes such as hypertension can 
help in titrating medication for maximal potency. And endocrinological regulation can 
be reflected in daily profiles of salivary cortisol. 


Ambulatory Assessment Can Account for Person-Situation Interactions 


A key characteristic of ambulatory research is repeated sampling of individuals across 
diverse situations and over prolonged periods of time. It enables within-person analyses 
(see also Hamaker, Chapter 3, this volume) and allows capture of person-situation inter- 
actions. Thus, contextual dependencies of psychophysiological reactivity measures can 
be investigated, and within-person stability of individual reactivity in everyday life can 
be inferred. Furthermore, long-term measurements render possible the disentangling of 
the effects of trait, temperament, or diagnosis from state-trait interaction unique to the 
laboratory (Kraemer, Gullion, Rush, Frank, & Kupfer, 1994; Pfaltz et al., 2009). Alter- 
natively, averaging over multiple data points in various situational contexts is possible, 
leading to a greater reliability and generalizability of the assessment. This approach is 
well suited for idiographic issues largely being neglected by laboratory research, in which 
data averaged across individuals within a single task or occasion can easily mask under- 
lying differences in individual processes. In contrast to single-occasion group designs, 
ambulatory multi-occasion measurement yields a more representative sample of diverse 
states of an organism and potential environmental determinants thereof. Many scientific 
insights have, in fact, been derived from a small number of intensively studied individuals 
(Garmezy, 1982). Thus, even small-N ambulatory studies may significantly expand cur- 
rent knowledge on psychophysiological functioning (for further discussion and examples 
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of innovative ideographic approaches on cardiovascular reactivity, see Friedman & San- 
tucci, 2003; Grossman, Karemaker, & Wieling, 1991). 

There are large differences in the range of situations people encounter in their lives, 
partly because of differences in their personalities (Ickes, Snyder, & Garcia, 1997). 
Patients who suffer from the clinical syndrome of agoraphobia, for example, avoid cer- 
tain places and situations that would otherwise elicit intense anxiety. It is also increas- 
ingly recognized that among healthy populations, people regulate their emotions in daily 
life by situation selection (Gross & Thompson, 2007). Whereas individuals in the real 
world (and during ambulatory assessment) can seek and avoid situations freely, and in 
accordance with their natural preferences, conditions in a laboratory are (ideally ran- 
domly) prescribed to the study participants. The laboratory research paradigm is often 
unable to tap into this important source of information on emotion regulation strategies. 
On the other hand, avoidance of certain negative emotion-eliciting situations in real life 
may make it difficult to study relevant emotion responses. In a later section on typical 
problems and solutions for real-life assessment, we present ambulatory study designs that 
circumvent this problem. 


Psychophysiological Functions 
That Can Be Measured in Real Life 


In an attempt to understand interactions of psychological and physiological processes in 
everyday life, psychophysiological ambulatory research is concerned with individual reac- 
tivity to specific (real-life) stimuli and differences in baseline functioning. Measures of dif- 
ferent physiological subsystems are targeted, such as the sympathetic-adrenal-medullary 
system, the vagal system, the respiratory system, or the hypothalamic-pituitary-adrenal 
axis. An increasing number of inexpensive yet sophisticated portable measurement sys- 
tems is available, and diverse task-specific software solutions for data processing and anal- 
ysis exist (Ebner-Priemer & Kubiak, 2007; Myrtek, Aschenbrenner, & Brugner, 2005). 
Table 12.1 provides an overview of channels that can be recorded, variables that can be 
analyzed based on the recorded biosignals, and available devices that contain these chan- 
nels (though not necessarily the analysis software to obtain the specific variables). 

ECG and accelerometry channels have been available for ambulatory recording for 
many years and frequently employed in ambulatory studies. These physiological signals 
are easy to obtain, relatively free of artifacts, and rather straightforward to analyze. 
Other ambulatory channels have been employed less frequently in field studies, because 
they are relatively novel (e.g., end-tidal pCO, capnography), more difficult to obtain (e.g., 
impedance cardiography [ICG]), and/or quite susceptible to movement or other artifacts 
(e.g., EDA, pulse plethysmography, ICG, EEG). 

For the registration of several signals in different combinations, multichannel devices 
exist. They are often highly flexible with regard to the number and range of sensors 
obtained simultaneously and allow a comprehensive assessment of various physiological 
systems during day and night (Wilhelm, Pfaltz, & Grossman, 2006; Wilhelm et al., 2003). 
Laboratory assessment of chronobiological processes is very burdensome for subjects; 
thus, the development of new ambulatory systems, capable of duplicating the full labora- 
tory polysomnography setup, is deemed a quantum leap. Capturing detailed records of 
EEG, EOG, EMG, ECG, respiration, activity, and light, these devices enable detailed 
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TABLE 12.1. Biosignals That Can Be Obtained Outside the Laboratory with Modern 
Recording or Sampling Devices 


Biosignal 
e Analyzed variable 


Electrocardiogram 


e Heart rate (or heart period) 
e Heart rate variability 


e Respiratory sinus 
arrhythmia 


e Type and number of 
arrhythmias 


e T-wave amplitude 
e ST segment 


Pulse plethysmography 


e Heart rate 
e Pulse volume amplitude 


e Pulse wave velocity 


e Oxygen saturation 
Arterial blood pressure 


e Systolic blood pressure 
e Diastolic blood pressure 


Impedance cardiography 
e Preejection period 


e Left-ventricular ejection 
time 

e Stroke volume 

e Cardiac output 


Respiration pattern 


Respiratory rate 
Tidal volume 
Minute volume 


e 
e 
e 
e Respiratory variability 


Abbreviation Recording/measurement 


ECG 
HR/HP 
HRV 


RSA 


TWA 
ST 


PPG 


HR 
PVA 


PWV 


Electrodes 

RR intervals 

Standard deviation within HR 
time series, better spectral 
power within distinct 
frequency bands (e.g., 0.10 
Hz) 

RSA parameters (e.g., 
peak-valley-amplitude) 
or HF (high frequency) 
spectral power (best with 
procedurally or statistically 
controlled respiration) 

Specialized ECG 
quantifications, best by 
using multichannel ECG 


Photoelectric sensor at finger 
or earlobe 

RR intervals 

Systolic minus diastolic pulse 
wave 

Time from beginning of 
cardiac contraction to 
maximum of peripheral 
pulse 

Oxygen content in arterial 


blood 


Device with upper arm cuff 
(oscillometric), finger cuff 
(Penaz method) 

Upper pressure of pulse wave 

Lower pressure of pulse wave 


Pairwise electrodes at neck and 
thorax 

Time from ECG Q-wave to 
opening of aortic valve 

Time from opening to closing 
of aortic valve 

Calculated by Kubizek formula 

SV multiplied by HR 


Band(s) for inductive 
plethysmography 

Breath cycles per minute 

Depth of breathing 

RR multiplied by Vt 

Breath-by-breath variability 
of duration or depth of 
breathing 


Device 


Vitaport (TEMEC 
Instruments, Inc., 
Kerkrade, Netherlands), 
Varioport (Becker 
Meditec, Inc., Karlsruhe, 
Germany), Nexus (Mind 
Media, Inc., Roermond- 
Herten, Netherlands), 
VU-AMS (University of 
Amsterdam, Amsterdam, 
Netherlands), HeartMan 
(Heartbalance, 

Inc., Graz, Austria), 
SOMNOscreen 
(SOMNOmedics, Inc., 
Randersacker, Germany), 
Bioharness (BIOPAC 
Inc., Goleta, CA, USA) 


Vitaport, Varioport, 
Nexus, SOMNOscreen 


Mobil-O-Graph (I.E.M. 
Inc., Stolberg, Germany), 
TensioDay (TensioMed, 
Ltd., Budapest, 
Hungary), Portapres 
(FMS, Inc., Amsterdam, 
Netherlands) 


VU-AMS 


Inductotrace 
(Ambulatory Monitoring, 
Inc., Ardsley, NY, USA), 
Vitaport, Varioport, 
SOMNOscreen 


(cont.) 
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TABLE 12.1. (cont.) 


Biosignal 
e Analyzed variable 


Abbreviation Recording/measurement 


Device 


Capnogram 


Electrodermal activity 

e Skin conductance level 

e Skin conductance response 
e Nonspecific fluctuation 
Temperature 

e Peripheral temperature 

e Core body temperature 
Electromyogram 

e Facial expression muscles 


e Eyeblink reflex 


e Motility 
e Tremor 


Accelerometry (actigraphy) 
e Motility 


e Posture 


e Tremor 


Electrooculogram 


Electroencephalogram 


e EEG-alpha, etc. 


Salivary sampling 
e Cortisol 


pCO, 


EDA 
SCL 
SCR 
NSF 


TEMP 


EMG 


MOT 
ACC 
MOT 


POS 


EOG 


EEG 


CORT 


Cannula at the nostril; or 
diffusion sensor at lower 
arm or earlobe 


Electrodes at palm of hand 

Mean value 

Amplitude of stimulus-related 
reaction 

Number of spontaneous SCRs/ 
minute 


Surface sensor (e.g., at hand 
or foot) 
Rectal sensor 


Skin electrodes above muscles: 
face (e.g., corrugator 
supercilii), underneath eye 
(orbicularis oculi) 

At upper thigh, for example 

At lower arm, for example 


Calibrated piezoresistive sensor 

At upper thigh or trunk, for 
example 

At upper thigh (standing) and 
trunk (upright vs. supine) 

At hand, for example 


Electrodes at head 


Head electrodes in standard 
positions 

Amplitude of spectral 
frequencies 


Chewing on cotton swab 
Concentration 


Capnocount (Weinmann, 
Inc., Hamburg, 
Germany) 


Vitaport, Varioport, 
VU-AMS 


Vitaport, Varioport, 
iButton (Maxim, Inc., 
Sunnyvale, CA, USA), 
CorTemp (HQ, Inc., 
Palmetto, FL, USA) 


Vitaport, Varioport, 
SOMNOscreen 


Actiwatch (CamNtech, 
Inc., Cambridge, 

UK), Motionlogger 
(Ambulatory Monitoring, 
Inc., Ardsley, NY, USA), 
Vitaport, Varioport, 
SOMNOscreen, 
Bioharness 


Vitaport, Varioport, 
SOMNOscreen 


Vitaport, Varioport, 
SOMNOscreen 


Salivette (Sarstedt, Inc., 
Newton, NC, USA) 


sleep-stage scoring that once was available only if one spent the night in a sleep research 
clinic. Modern devices have advanced analysis capabilities for efficient data reduction, 
and built-in sensors help to provide contextual information (e.g., level of physical activity, 


posture, speaking). 


A variety of easy-to-use systems are on the market. Simple wristwatch HR monitors 
(e.g., from Polar, Inc. and Timex, Inc.) or actigraphs (e.g., from Ambulatory Monitoring, 
Inc., and Cambridge Neurotechnology, Inc.) are basic tools for tracking HR and physical 
activity conveniently outside the laboratory. Offering high temporal resolution actigraphs 
can reasonably accurately estimate sleep-wake cycles, sleep efficiency, daytime physi- 
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cal activity level, and circadian rhythm (Teicher, 1995). Accurate multichannel portable 
monitors exist for the assessment of ICG parameters (Goedhart, Kupper, Willemsen, 
Boomsma, & de Geus, 2006), HR variability (Grossman, Wilhelm, & Spoerle, 2004), 
calibrated temporal and volumetric respiratory measures (Grossman, Spoerle, & Wil- 
helm, 2006), or electrodermal activity (Wilhelm & Roth, 1998b). 

Data on physiological processes and movement patterns may be supplemented by 
endocrine measures via special saliva collection swabs, collected over days or weeks and 
stored in small plastic containers (Salivettes) in the refrigerator. The samples can then be 
sent to the laboratory for analysis of concentration of the stress hormone cortisol and/or 
other relatively stable biological compounds (e.g., Giese-Davis et al., 2006). 

With some systems, data not only are stored internally for later download but also 
can be wirelessly transmitted in real time to a computer for display and analysis. In seri- 
ous sports, competitive athletes (e.g., runners) and their coaches have made use of such 
telemetric capacities for a long time, for example, to develop and adapt training plans 
and evaluate improvements in performance. Telemetric capacities may also prove useful 
for evaluating assessment or even targeted treatment in clinical applications. Some wrist 
actigraphs already have a built-in biofeedback application, intended, for example, for 
operant treatment of attention-deficit/hyperactivity disorder (Tryon, Tryon, Kazlausky, 
Gruen, & Swanson, 2006). These devices beep when the activity level surpasses defined 
thresholds of intensity and duration to remind the patient to calm down. Similar operant 
interventions may be designed for children with impulsivity problems. 

A number of remarkable technical developments have been achieved, with much 
innovation coming from the German-speaking countries and the Netherlands (see Soci- 
ety for Ambulatory Assessment at www.ambulatory-assessment.org). Because current 
trends reflect a continuous miniaturization of physiological recording systems, future 
developments will presumably comprise a variety of wearable devices, integrated, for 
example, in clothing or jewelry. Embedded sensors will be able to communicate wire- 
lessly or via on-body networks, and could transmit the registered data online to hospitals 
and research centers for monitoring purposes. Conceivable also are implantable or swal- 
lowable real-life monitoring and intervention devices, which, comparable to present-day 
cardiac pacemakers, operate directly in the body (e.g., a core temperature sensor that 
transmits data wirelessly to a main unit worn on the body). 

Physiological monitors and their user interfaces can in the future be so nonintrusive 
and useful that they will likely become part of life, initially in certain conditions (e.g., 
acute diabetes, epilepsy, cardiac diseases), then in early adopters (e.g., athletes), then 
for general consumer use. Critics are concerned that real-life assessment strategies are 
more prone to violating privacy than other approaches, for example, by assessing sub- 
jects who engage in unpredictable situations in which data assessment may be unwanted. 
Such issues need to be clarified before study participants give their informed consent, 
and research policy should provide recommendations for how to proceed in problematic 
cases. In general, however, the acceptance of ambulatory assessment appears to be much 
higher than researchers and clinicians initially expected (Hageböck, 1994). Many stud- 
ies, of course, deal with students who get paid for their participation, or with patients 
who may benefit from the application of ambulatory methods (e.g., monitoring via ECG 
or measurement of blood pressure) directly in terms of improved diagnostics or custom- 
ized treatment. Acceptance and compliance may be different among other populations 
and for other environments, such as the workplace (Fahrenberg, 1996). 
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The general consumer also has evolved toward technology. For many people, mobile 
phones and other personal data systems have become an integral part of their daily and 
social lives. They “got used to sharing” (David Lammers-Meis, 2010, as cited in Wolf, 
2010), and the Internet gives them a multitude of opportunities to do so, from social 
networks, such as Facebook, to specific Web forums for all conceivable interests. For 
example, on MedHelp, one of the largest Internet forums for health information, users 
launch more than 30,000 new personal tracking projects every month. Detailed records 
of sleep, exercise, alertness, productivity, sex, food, mood, or location are being tracked, 
measured, shared, and displayed voluntarily. Social media have made it seem normal to 
share everything—eventually this perceived normality might pave the way for further 
developments in real-life assessment, even though issues of data privacy, biocompatibility, 
and a potential “cyborg” evolutionary advantage will certainly have to be addressed. 


Psychophysiological Assessment in Real Life: 
Typical Problems and Solutions 


Psychophysiological assessment in real life presents various challenges. The amount of 
data can be staggering, and even the most sophisticated physiological measures lack 
meaning when they are not related to the conditions under which they were collected. 
Without contextual information, the interpretation of physiological data can be difficult 
or even impossible. Thus, independent of specific psychological effects (e.g., emotions), 
individual differences in physiological functioning derive from a number of factors, 
including genetics, general health, medication usage, physical fitness level, time-of-day 
effects, and momentary physical activity (discussed in Wilhelm & Roth, 2001). Besides, 
of course, physiological reactions are also highly responsive to emotional stimuli, stress- 
ful tasks, or social engagement. 

These issues can be dealt with by taking contextual information into account when 
designing an ambulatory study. Due to the quantity and variety of stimuli experienced in 
a real-world environment, this can be a challenging task and, unfortunately, the literature 
that comprises ambulatory investigations is relatively sparse and recent. Designing an 
ambulatory study is not an easy task. On the one hand, ambulatory research designs are 
highly flexible and can thereby address very different types of research questions. On the 
other hand, it is the investigators who have to make a range of decisions for a particular 
research question, including decisions on the type and number of channels recorded (rang- 
ing from one to over 20), assessment duration (ranging from minutes to months), sampling 
mode (ranging from continuous monitoring to time- or event-dependent sampling), degree 
of situational context awareness (ranging from no context information to broad context 
awareness), and naturalness (ranging from fully naturalistic to fully structured assess- 
ment protocols). The range of choices within these relatively independent dimensions and 
the resulting multitude of combinations of options provide a large number of different 
study designs. And because validity, interpretability, and conclusiveness of results largely 
depend on the choices made, researchers should consider the existing possibilities very 
thoroughly. The following section presents a catalogue of issues and possible solutions 
that we hope can help researchers develop sound strategies for ambulatory investigations 
and make the best possible decisions when planning and executing a study. 
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Naturalness 


Naturalness and the associated face validity are distinct advantages of ambulatory study 
designs and perhaps the main argument for leaving the laboratory. In fact, many ambula- 
tory studies have been designed to capture psychophysiological functioning under real- 
life conditions, and the corresponding study designs can range from fully naturalistic to 
fully structured assessment protocols. 

An unstructured or “free running” assessment protocol (e.g., 24-hour monitoring of 
patients with panic disorder) (Margraf et al., 1987) provides data on the frequency and 
intensity of naturally occurring target events (e.g., panic attacks) over the course of time. 
However, this approach carries a risk of getting only a few examples of the target event, 
reducing statistical power for its analysis. If the main aim of the study is to assess the 
processes involved in these target events, it might be a proper strategy to screen partici- 
pants for frequency of occurrence first, then examine only those subjects who frequently 
present the target phenomenon. A limitation of the approach, of course, is that a selection 
bias is created; thus, findings can be generalized only to such extreme subgroups. 

In so-called “semistructured” ambulatory studies, most assessments take place under 
naturalistic conditions. However, in order to trigger relevant events or behaviors in a natu- 
ralistic context, certain activities are prescribed to participants or situations outside the 
laboratory (e.g., entering a shopping mall for patients with agoraphobia). This way, some 
of the control of laboratory studies is preserved, yet normal daily life routine and context 
are only minimally distorted. An important variant of this approach is long-term ambula- 
tory assessment that includes one or several ambulatory baseline measurements, during 
which subjects are asked to sit alone in a room for a few minutes without speaking or 
performing any other activity. It is essential to conduct these baseline recordings in a stan- 
dardized way. Because any distortion may influence results, physical activity, speaking, 
telephone ringing, presence or appearance of other persons, ambient noise, and so forth 
should be actively minimized or noted by the participant to exclude contaminated baseline 
recordings from the analysis. Whereas laboratory baseline findings, for reasons explained 
earlier, may be compromised and often lack representativeness and external validity, thor- 
oughly performed ambulatory baseline measurements in fact allow the evaluation of real- 
life functioning. By means of multiple ambulatory baselines, the degree to which a certain 
psychological or physiological feature is a trait or a state also can be examined (Kraemer 
et al., 1994): Whereas a trait-like feature (e.g., a psychophysiological endophenotype) is 
rather constant across time, a state-like feature should vary considerably across multiple 
baselines. Furthermore, with baseline assessments from both the laboratory and the real 
world available, the question of laboratory-to-life generalizability can be addressed. 

In a fully structured protocol, also termed a field study, the degree of experimental 
control can be comparable to that of a laboratory study. Participants are randomly allo- 
cated to different conditions (e.g., situations or treatments) or to variations of settings 
under real-life conditions. The investigator typically controls the independent variables 
either by instructing participants to engage in certain defined activities or arranging spe- 
cific situations that participants encounter. Again, the conditions require standardiza- 
tion. All participants should encounter the same situations in the same order, with the 
same duration. Consequently, data are especially easy to analyze and unbiased. 

The Stanford Flying Phobia Study is an example of a field experiment that examined 
the psychophysiological reactions of patients to a real flight situation (Wilhelm & Roth, 
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1996, 1998b). The structured study protocol involved transportation to the San Francisco 
Airport and a commercial flight on a small airplane unaccompanied by the experimenter. 
Participants wore a multichannel device and used a diary to record their subjective emo- 
tional reactions to specific segments of the flight (e.g., boarding, taxiing, and take-off). 
Physiological reactions were later analyzed and averaged for these segments. 

Such procedures permit standardized stimulus-based averaging within each individ- 
ual, as well as group statistics across individuals in a fully balanced design that compares 
phobic patients with healthy controls. Results indicated that the intensity of anxiety reac- 
tions was quite high in phobic participants, in both subjective and physiological domains. 
This degree of elicitation of phobic reactions could not have been achieved in a laboratory 
setting, even with realistic flight simulations (e.g., in virtual reality). 


Number and Type of Channels 


The decision on number and type of channels recorded is largely determined by the 
research question (or more precisely, by the dependent and control variables deemed rel- 
evant), the availability of sensors and recording devices, and, most important, consid- 
eration of subject burden. In conjunction with the assessment of physiological activity, 
ambulatory studies typically employ at least an electronic diary, self-reported experien- 
tial variables, and observations (e.g., emotional experience, bodily symptoms, or specific 
overt behaviors, such as smoking or partner behaviors). Electronic diaries can provide 
several sources of information about various constructs. Often they are utilized to record 
setting, but they also allow implementation of cognitive tests (e.g., attention or memory 
tests examining reaction times in response to different items), that, depending on the 
research question, contribute useful information. Still, psychophysiological research can 
benefit significantly from additional control channels and sensors that supplement subjec- 
tive diary data by providing important objective contextual information for data inter- 
pretation, and an additional perspective on response patterning. 

The decision on the number of channels recorded in a physiological recording device 
(as well as the number of items queried in an e-diary and the number of additional devices 
employed) should be based on some form of cost-benefit analysis. The general rule is: 
Employ as many measures as necessary and as few as possible. Aside from the cost of 
equipment involved, any added channel or item increases subject burden (the more the 
longer the recording lasts), and needs to be justified by a significantly increased gain of 
information. Some channels (e.g., accelerometer, skin temperature) are less burdensome 
than others (e.g., end-tidal pCO,, blood glucose monitor). 

For certain very limited research or diagnostic questions, recording only one channel 
may be sufficient. Typically, however, diverse channels are employed in psychophysiologi- 
cal research. Devices that incorporate several relevant channels into a fully integrated 
unit are the optimal solution for multichannel studies, since they make recording easier 
and reduce subject burden. The same data could alternatively be obtained by employing 
several devices (e.g., a physiological recorder, an accelerometer, and an electronic diary). 
In this case, the concurrent recordings must be synchronized by equalizing the internal 
clocks at the beginning of the recordings (e.g., by starting them with the same PC) and 
making sure they stay synchronized over the course of the entire assessment (e.g., by 
pressing a marker button or stopping them at the same time, then comparing clock dif- 
ferences between the devices). 
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According to our experience, subject burden is rated particularly high if sensors 
produce discomfort, reduce the range of normal activities (e.g., showering, exercising), 
affect sleep, or are visible (often causing embarrassment in social situations). These issues 
should be taken into account, especially in studies with clinical populations that often are 
already burdened or even stigmatized by their disorder. 

These problems are amplified in relation to assessment duration, which can vary 
from minutes (e.g., exposure to a phobic situation; Wilhelm & Roth, 1998b) to months 
(e.g., the development of circadian rhythm in newborns; So, Adamson, & Horne, 2007). 
Just like the number and type of channels recorded, the appropriate duration depends 
also on the research question and needs to be chosen to obtain enough data for hypoth- 
esis testing, while keeping subject burden, study cost, and analysis load to a minimum. Of 
course, the investigation of rare events (e.g., epileptic seizures or panic attacks) requires 
long assessment intervals to increase the likelihood of actually capturing the phenom- 
enon in question. If, on the other hand, a study is concerned with circadian variations of 
psychological or physiological processes during seasonal changes, at the minimum, 1 day 
during different seasons needs to be examined. Sampling across several days per season, 
then averaging across these days, can increase reliability, particularly for variables char- 
acterized by considerable between-day variation (e.g., mood). Along a similar line, week- 
ends and weekdays often differ in terms of the daily routines and, accordingly, the situa- 
tions subjects encounter and the experiences they make. Also, any special characteristics 
of the sample under scrutiny should generally be considered because students or certain 
clinical groups (e.g., patients with mental disorders) may present unstable lifestyles. For 
representative data collection, these effects need be accounted for by sampling on 1 or 
even more days from both weekdays and the weekend. 


Sampling Mode 


Modern ambulatory recording devices feature large storage capacities; thus, control 
channels and most physiological channels are recorded in a continuous fashion. Excep- 
tions are the collection of endocrine measures (e.g., saliva samples; see Schlotz, Chapter 
11, this volume) and blood pressure monitoring. Self-reports can also be collected only 
intermittently. For more detailed information on the different sampling protocols, please 
see Conner and Lehman (Chapter 5), Kubiak and Krog (Chapter 7), and Moskowitz and 
Sadikaj (Chapter 9) in this volume. At this point, we should mention that the investiga- 
tion of interrelationships of physiological, psychological, situational, and/or behavioral 
factors requires segmentation of continuous data. 

The interpretation of physiological data from real-life studies can be difficult. 
Whereas laboratory research is characterized by very constrained settings and stringent 
control of confounding variables, participants in ambulatory studies are almost always 
unsupervised; they move freely during assessment, and the sequence and duration of 
stimuli that occur in real life are not known in advance. Measurement artifacts and meta- 
bolically related changes can easily be confused with endogenous dysregulations or alter- 
ations induced by psychological processes (e.g., emotions or stress). Thus, a high degree 
of context awareness is necessary to dissolve interpretational ambiguities of ambulatory 
data. 

Medical ambulatory monitoring studies and earlier psychological investigations often 
have neglected this important data source. However, the complexity of data streams, and 
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the multitude of factors that influence them, requires a high degree of context awareness 
for adequate data interpretation, which can be enhanced by specific channels, items in an 
electronic diary, semistructured protocols, and statistical modeling. Thus, when design- 
ing an ambulatory study, investigators should give much consideration to potential situ- 
ational effects (e.g., certain real-life factors that may trigger the physiological responses 
in question or moderate the physiological functioning) and potential time- or situation- 
dependent confounders of primary dependent variables (e.g., variations in concurrent 
physical activity, metabolic activity, ambient temperature, and time of day). Correspond- 
ing information on situational characteristics, physical activity, social engagement, sub- 
stance intake (e.g., food, caffeine, nicotine and drugs), sleep-wake phases, light expo- 
sure, and so forth can be retrieved by employing additional channels or implementing 
specific items in an electronic diary. 

The most basic type of context information is time of day and date of recordings. 
Time stamping is standard in electronic diaries and ambulatory physiological devices. 
Based on this information, data can be systematically aggregated within and across sub- 
jects to identify individual temporal patterns and perform group comparisons. Event 
marker channels in physiological recorders provide further contextual information at 
a very basic level. They allow subjects (e.g., patients with panic disorder) to register the 
occurrence of predefined events (e.g., onset of panic attacks) simply by pressing a marker 
button. Consequently, the collected data can be segmented meaningfully during analy- 
sis. In a 24-hour study of panic disorder, use of an event marker channel allowed for 
response-locked averaging of HR responses (Margraf et al., 1987). In the Stanford Flight 
Phobia Study, participants were instructed to press a marker button during predefined 
segments of the flight, to fill out a paper-and-pencil diary on anxiety and symptoms, then 
to press the marker button twice. In subsequent analyses of the physiological data, the 
single and dual marker presses could be easily identified as starting and ending points of 
specific flight segments. 

It should be anticipated that subjects (particularly when encountering challenging 
situations) sometimes forget to press the button. For this case, additional context chan- 
nels can supplement the recording and provide important information on correct assign- 
ment of data. In the Stanford Flight Phobia Study, these included a physical activity chan- 
nel, indicating the beginning and end of boarding, and an air pressure channel, indicating 
the exact moment of take-off, cruising, beginning of descent, and landing. By using the 
redundancy from these additional channels, many subjective marker entries can be veri- 
fied, and correct data analysis can be ensured. 

When it comes to identifying several specific situations in real life, more elaborate 
channels are necessary. Electronic diaries allow implementation of a range of entry menus 
for diverse situational settings in which subjects may find themselves during assessment. 
Categories of situations or activities in real-life studies may, for example, include home 
versus work, physically active versus inactive, sitting versus lying, and wake versus 
sleep. 

Assessing all situational categories whenever they change would be far too demand- 
ing for participants, and lead to unreliable information and lack of compliance. However, 
the problem can be solved in at least two ways: (1) by reducing the number of situational 
categories based on a definition of relevant settings in terms of a priori hypotheses (e.g., 
negative mood is higher at work than at home); and (2) by replacing the e-diary with 
context marker channels that can continuously record a variety of situational catego- 
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ries: indoors versus outdoors (using a light intensity sensor, or better yet, a sensor that 
also records spectral light composition to distinguish indoors vs. outdoors lighting), 
physically active versus inactive (using a wrist-worn accelerometer sensor), alone versus 
accompanied (using the Electronically Activated Recorder [EAR]; see Mehl & Robbins, 
Chapter 10, this volume), eating versus not eating (using the EAR or special microphones 
attached to the head), talking versus quiet (using the EAR, a throat microphone, or res- 
piration pattern sensor; see below), stationary versus mobile (using the EAR), standing 
versus sitting versus lying (using at least two posture sensors, one on the trunk and one 
on the thigh), and awake versus sleep (using an accelerometer or, for detailed analysis, 
polysomnography). 

Inferring a person’s complex facets of psychological experience solely from the physi- 
ological data from real-life assessment is not possible. The relationship between mind and 
body may be particularly distorted or misleading in specific situations due to the fact that 
physiological systems primarily serve metabolic and homeostatic functions, such as motor 
activity, energy expenditure, and circadian rhythms. Thus, time- or situation-dependent 
confounders can substantially affect physiological DVs and present a major obstacle to 
straightforward interpretation. Emotion or stress effects in psychophysiological variables 
(e.g., HR or skin conductance) can easily be masked by environmental factors contribut- 
ing to circadian and quasi-random variation in physical activity, social interaction, and 
substance intake. 

Physical activity confound is a major problem in the unconstrained settings of ambu- 
latory assessment (Wilhelm, Pfaltz, Grossman, & Roth, 2006). Thus, clinically relevant 
ambulatory monitoring systems capable of registering motor activity need to be employed 
to obtain an evaluative context. Otherwise, ordinary exercise-induced physiological 
changes run the risk of being confused with stress, emotion, or disease indicators. For 
example, abnormal ECG values may be due to physical exercise rather than representing 
some kind of dysregulation. Employing accelerometry (also good for posture detection) 
or electromyography (e.g., upper leg) to identify motorically active periods can solve this 
ambiguity. In the statistical analysis, one can then adjust for motor activity by using it as 
a covariate or by determining the “additional HR” (see below). 

Of course, HR is functionally linked to physical activity (oxygen consumption of 
the musculature). However, because it is also highly responsive to stress and anxiety, 
HR very often is used as an index of emotional arousal. Under the well-controlled condi- 
tions of laboratory research or structured protocols, HR proved to be a sensitive index of 
anxiety (Wilhelm & Roth, 1998c). During real-life assessment, in contrast, HR is highly 
correlated with motility (Grossman et al., 2004); hence, it is not a reliable measure of 
anxiety or stress when the level of concurrent physical activity is unknown. The psycho- 
logical component is relatively small compared to the substantial metabolic component, 
and can easily be masked by it. “Unmasking” the additional (nonmetabolic) HR effects is 
feasible when physical activity is recorded reliably, and with sufficient time and intensity 
resolution. Use of a special procedure of calibration (stepwise increase in physical work- 
load) and autoregression allows partialing out the physical activity component for each 
individual. This provides a continuous measure of emotional activation (e.g., anxiety or 
stress) that is relatively free of physical activity confounds (see Figure 12.1). For further 
data-analytic methods, see Part III in this handbook. 

Social interaction is an important context variable in psychophysiological studies, 
because it modulates emotions. Social exchanges are typically characterized by verbal 
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conversations; thus, speech activity can serve as an indicator of social engagement. The 
frequency and distribution of speech behavior across the day might be an important 
situational and behavioral variable in psychophysiological research concerned, for exam- 
ple, with mood and emotion, social isolation, or depression or drug effects. However, 
speech activity may also be a confounding factor in assessment of physiological emo- 
tion channels, and lead to confusing ordinary, speech-induced physiological changes and 
emotional reactions. Whereas HR increases by 5-10 bpm during conversational (unemo- 
tional) speaking, emotionally challenging situations can raise HR much more (Linden, 
1987). Thus, depending on the research question, speech periods during real-life assess- 
ment should be identified. This can be realized by employing an external microphone 
(which bears the problem that it may also record external noise or speech from others), 
or a throat microphone (which is rather uncomfortable and attracts attention), or by 
inferring speech periods from the respiratory signal of the person examined. Speaking 
is accompanied by changes in respiration, and the ratio of inspiratory time/total time of 
breaths is a good indicator of speech (Wilhelm, Handke, & Roth, 2003). 

Ingestion of various substances, such as food, caffeine, nicotine, alcohol, medica- 
tion and drugs, also clearly alter the physiological landscape. For example, during and 
after a high-carbohydrate meal, HR and blood pressure often increase for several hours 
(Fagan, Conrad, Mar, & Nelson, 1986). Caffeine and nicotine have sympathetic effects 
on the cardiovascular and electrodermal systems. Medications prescribed for allergies 
or for cardiovascular, respiratory, or mental disorders often have sympathetic and/or 
parasympathetic effects. Electronic diary entries of time, type, and amount of ingestion 
are essential to examine effects on emotion, cognition, and experience, or to control for 
their influence on DVs. 

Diurnal endogenous influences from the circadian pacemaker (the suprachiasmatic 
nucleus) are especially related to physiological regulation patterns, including the tempera- 
ture and cardiovascular systems (van Eekelen, Houtveen, & Kerkhof, 2004). They may 
additionally also influence a number of psychological variables, such as mood, alertness, 
and memory (Boivin et al., 1997; Dijk, Duffy, & Czeisler, 1992). Since circadian rhythms 
change slowly over a period of about 24 hours, their confounding effects can, at least to 
some degree, be prevented by adjusting for this slowly shifting “baseline” in the analysis. 
Segmenting by time of day and wake versus sleep is essential in accounting for circadian 
influences (Grossman, Deuring, Garland, Campbell, & Carlson, 2008). The correspond- 
ing information can be retrieved from electronic diary entries, accelerometer measure- 
ment, or polysomnography. 

The development of ambulatory sensors and algorithms capable of detecting and 
categorizing specific episodes (e.g., emotion) in real life, based on physiological changes 
and certain context variables (Picard, 1997), is a distant objective. To date, research on 
pattern recognition procedures for psychophysiological delineation of organismic states 
is still in its infancy and occurs exclusively in laboratory settings. 


Conclusion 
Psychophysiological methods provide a window for understanding human behavior and 


organismic functioning among both healthy people and those with a variety of physical, 
mental, and psychosomatic disorders. Combining subjective measures from self-reports 
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or interviews with objective physiological data, psychophysiological research accommo- 
dates a holistic concept of human functioning and can help in development of a sound 
understanding of mind-body connections and interactions. Much of our current scien- 
tific knowledge derives from laboratory research; however, as we have elucidated, such 
findings are often afflicted with problems. It is becoming increasingly obvious that we 
need ambulatory approaches to overcome the multiple laboratory confounds described, 
and to address essential and often ignored issues in psychological research. 

We have presented a wide range of arguments in support of increased and integra- 
tive use of ambulatory strategies. Today, both hardware and software for obtaining and 
analyzing physiological real-life data are available: Ambulatory recorders have become 
remarkably powerful and inexpensive. Also, analysis strategies and sophisticated soft- 
ware for processing the large amount of complex data have evolved in the past years and 
will continue to do so. Yet publications based on laboratory research outnumber those 
based on ambulatory observation and experimentation by several orders of magnitude. 
A future challenge in psychophysiological research will therefore be to begin to create a 
reasonable balance between the approaches. 

We have indicated that, particularly, psychophysiological ambulatory assessment 
poses a number of difficulties. Nevertheless, considering the multitude of design options 
carefully and making clever choices with regard to the respective research question can 
facilitate successful ambulatory research. We hope that more and more young scientists, 
as well as experienced laboratory researchers, will expand their research methods to real 
life to close an important gap and connect laboratory findings to what people experience 
in their daily lives. 
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CHAPTER 13 


Ambulatory Assessment 
of Movement Behavior 


Methodology, Measurement, and Application 


JOHANNES B. J. BUSSMANN 
ULRICH W. EBNER-PRIEMER 


ehavior is central to psychology in almost any definition of the field. A specific sub- 

domain of behavior is movement behavior, the overt performance of postures, move- 
ments, and activities in daily life, and its covert physiological consequences. Although 
overt behavior is a core aspect of psychology, assessment strategies have tended to focus 
on emotional, cognitive, or physiological responses. However, there is converging evidence 
that movement behavior is a unique and relevant domain of interest that cannot accu- 
rately be assessed by self-reports, questionnaires, or other instruments. The methodology 
of ambulatory assessment of movement behavior has benefited from a number of major 
developments over the last 15 years: New movement sensors and advanced computer 
algorithms allow precise, long-term, and user-friendly assessment of movement behav- 
ior, such as the amount of physical activity, momentary posture, basic types of motion, 
as well as movement pathologies in everyday life. Movement behavior thus includes not 
only quantitative aspects such as the amount and intensity of postures, movements, and 
activities but also qualitative aspects related to “how people move,” such as smoothness 
of movement, speed, and movement patterns. 

With this overview, we aim to introduce recent developments in ambulatory activity 
monitoring to a broad readership in psychology. We (1) report on some conceptual issues 
around the assessment of movement behavior; (2) provide a summary of techniques and 
instruments used to assess activity, postures, and movement; (3) discuss methodological 
aspects of measurement; and (4) demonstrate recent applications of ambulatory assess- 
ment of physical activity, posture, and movement within a wide range of psychological 
topics. 
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Conceptual Issues of Actual Movement Behavior 


Several issues play a role in deciding on whether to measure actual movement behavior. 
In this section we want to focus on some important points of discussion: the difference 
between performance and capacity, the influence of measurement setting, and the differ- 
ence between self-reported or experienced behavior and actual behavior. 


Performance versus Capacity and Daily Life versus Laboratory 


In medicine, and especially in physical medicine and rehabilitation (PM&R), the Inter- 
national Classification of Functioning, Disabilities and Health (ICF), which focuses on 
the functional consequences of diseases, is a leading model. One of the distinctions made 
within that model is the difference between the qualifiers performance and capacity. 
Performance is about what people do in their daily life (which is sometimes called “do 
do”), and capacity is about what people can do in an optimal environment (which is 
called “can do”). The difference between these qualifiers has its origin in the recognized 
fact that what people can do is—or can be—different from what they actually do: People 
often do not do what they can, as a result of personal or environmental factors. Literature 
and our own research have shown that relationships between capacity and performance 
parameters are generally weak or absent. There are some indications that this relation- 
ship is stronger among people with more severe disabilities (Gijbels et al., 2010; Hore- 
mans, Bussmann, Beelen, Stam, & Nollet, 2005). However, results generally indicate that 
behavior cannot be predicted from capacity tests, and that direct measurement of “do 
do” is necessary to obtain insight into it. 

A laboratory-based measurement can be done to assess capacity but sometimes is 
aimed at measuring performance. For example, in PM&R patients are often measured in 
a movement laboratory to assess their walking pattern. This is accomplished by simple 
timed tests, but frequently complex instruments (e.g., camera systems, force plates) are 
used. An important issue, then, is whether laboratory behavior is representative: Labo- 
ratory measurement is performed under the assumption that the behavior manifested 
and measured under laboratory conditions is representative of the behavior performed 
in daily life outside the laboratory. Doubt may be cast on this assumption in many cases, 
and this in itself is a decisive reason for conducting in-field measurements. For example, 
the artificial setting and the fact that a person is well aware of being “observed” may 
induce behavior and physiological processes that are different from those that would 
otherwise occur in normal daily life (see Reis, Chapter 1, this volume). For example, in 
a study of postpolio patients, Horemans and colleagues (2005) found that the heart rate 
while walking at a self-paced speed was significantly lower in a laboratory than in daily 
life; the same was not true for step rate (Figure 13.1). This difference is likely due to more 
demanding environments in daily life (e.g., dual- or multitasking, uneven ground). This 
and similar studies seriously question the generalizability of laboratory-based measure- 
ments. 


Questionnaires versus Activity Monitoring 


Even though objective assessment and subjective ratings of physical activity (PA) aim to 
measure the same construct, studies and systematic reviews do show that the two meth- 
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FIGURE 13.1. Heart rate and step rate in a standardized laboratory environment, and in daily 
life conditions. Based on Horemans, Bussmann, Beelen, Stam, and Nollet (2005). 


ods are not closely related. In a systematic review, Prince and colleagues (2008) compared 
subjectively (self-report; e.g., questionnaire, diary) and objectively (directly measured; 
mostly accelerometry) assessed PA in adults. Comparing 187 studies, the authors found 
low-to-moderate correlations between self-report and direct measures. Figure 13.2 rep- 
resents correlation coefficients between direct PA and self-report measures of PA in 148 
studies that reported such a coefficient. Because the same construct (PA) is measured by 
two different assessment methods, we would expect high correlations (values above .8). 
Overall, correlations were low to moderate, with a mean of .37 (SD = 0.25; range -.71 
to .98). Findings suggest that the measurement method does have a significant impact on 
data revealed. Compared to PA directly measured by accelerometers, self-report measures 
of PA were generally higher (studies with males and females: 44% difference; studies with 
males only: 44% difference; studies with females only: 138% difference). 

In a systematic review of PA in children, Adamo, Prince, Tricco, Connor Gorber, 
and Tremblay (2009) compared 83 studies using indirect versus direct measures. Among 
studies that reported combined male and female data, the average difference between 
accelerometer and indirect measures was 147%, with higher scores for indirect measures. 
As in reviews on adults (Prince et al., 2008), indirect measures overestimated PA to an 
greater extent in females (584%) than in males (114%). The majority of studies that com- 
puted correlations between indirect and direct measures found low-to-moderate associa- 
tions (-.56 to .89), albeit some were found to be significant. The studies providing Bland- 
Altman plots (a method of data plotting to analyze the degree of agreement between two 
measures of the same construct) showed poor individual agreement between measures. 

All in all, both systematic reviews (Adamo et al., 2009; Prince et al., 2008) showed 
substantial discrepancies and only moderate correlations between indirect methods and 
direct measures of assessing PA. Consequently, the authors questioned the widespread 
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FIGURE 13.2. Scatterplot of all correlation coefficients (random-like ordered, x-axis) between 
direct measures and self-report measures in 148 studies. From Prince et al. (2008). Reprinted with 
permission from BioMed Central. 


approach to justify the use of the more cost-effective and less invasive indirect methods 
by positing meaningful correlations between indirect and direct measures. Systematic 
reviews simply do not support such a “quick and dirty” approach. Rather, both reviews 
support our general statement that actual PA cannot be accurately assessed by subjective 
self-reports. 

The reasons for the discrepancies between objective and subjective assessments of 
PA remain unclear. One possibility is that the memory of PA is retrospectively distorted, 
just as in other subjective self-reports (e.g., Ebner-Priemer et al., 2006; Ebner-Priemer & 
Trull, 2009; see also Schwarz, Chapter 2, this volume, for further common distortions). 
In fact, low correlations were also found in studies assessing subjective reports of PA with 
diaries, that is, in near real time (see Fahrenberg, 1996). Individuals might also be unac- 
customed to estimating the amount of their own PA, and their ability to do so may not be 
well developed. However, objective measures certainly also have weaknesses and cannot 
uncritically be taken as the “gold standard.” 

But do the reported discrepancies matter beyond academic discussion? Reilly and 
colleagues (2008) reported a convincing example. National surveillance of pediatric PA 
in the United Kingdom relies on subjective (parental) reports of PA, and survey findings 
showed that public health exercise-related targets were being exceeded by over 75% (e.g., 
in the Scottish Executive, 2005). Comparable U.K. studies that measured PA by acceler- 
ometry suggest, however, that less than 5% of children and adolescents meet the target of 
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60 minutes of moderate-to-vigorous physical activity per day. These discrepancies are of 
major importance for national health policies. Two other convincing cases for the added 
value of ambulatory activity monitoring come from Buchman, Wilson, and Bennett 
(2008) and Barnes and colleagues (2008), who investigated whether daily activity is asso- 
ciated with cognitive functioning in older persons. Buchman and colleagues (2008) used 
structured evaluation of cognition and objective measures of total daily PA by actigraphy 
in 521 older persons without dementia. Data revealed a positive association between 
total daily activity and a broad range of cognitive abilities. Interestingly, self-reported PA 
was not associated with cognition. Similarly, Barnes and colleagues investigated 2,736 
older women in their 80s, without evidence of dementia, and found that women in the 
highest movement quartiles had better mean cognitive test scores than those in the low- 
est quartile. Again, associations were independent of self-reported walking. Overall, the 
hypothesized correlation between PA and cognitive functioning in older adults emerged 
only with objective measures of PA. 

In summary, poor concordance between ambulatory activity monitoring and sub- 
jective self-reports of activity has been demonstrated on the basis of within-subject cor- 
relations using real-time data capture to assess self-reports, as well as retrospectively 
recalled subjective ratings. While questionnaires are undoubtedly valuable as a method 
for studying subjective (mental) representations of experience, attitudes, and behavior, 
self-assessment of this kind cannot serve as a substitute for the collection of actual behav- 
ioral data in everyday life (Baumeister, Vohs, & Funder, 2007; Fahrenberg, Myrtek, 
Pawlik, & Perrez, 2007). Results from studies relying solely on self-reported PA should 
be interpreted with caution (Rapport, Kofler, & Himmerich, 2006; de Vries, Bakker, 
Hopman-Rock, Hirasing, & van Mechelen, 2006; Ward, Evenson, Vaughn, Rodgers, & 
Troiano, 2005). Although objective measures have their limitations as well, they offer 
many advantages and unique possibilities. In the next section we discuss some ambula- 
tory activity monitoring categories in more detail. 


Techniques and Instruments 


Ambulatory monitoring of posture and motion means using devices objectively and con- 
tinuously to measure and record activity, posture, and motion in a natural environment. 
There has been no consensus about either the terminology or the taxonomy of ambula- 
tory activity monitoring instruments. However, in this chapter we distinguish three cat- 
egories of devices to help researchers new to this field to develop a basic understanding. 
The categories are based on their output: (1) activity counters, (2) actometers (or acti- 
graphs), and (3) multichannel ambulatory movement sensor devices. 

The instruments may differ widely between and within these device categories, for 
example, regarding the type and number of sensor locations (arm, leg, waist, multiloca- 
tion), the number of measurement axes (one, two, or three dimensions), the direction 
of sensitive axes, data storage (sensor and storage in one unit, one or more sensors con- 
nected to a data logger), data transmission (connected to PC, Internet, mobile phones), 
data processing and analysis (real-time, postmeasurement, and fuzzy logic neural net- 
works; “learning” systems), outcome measures, and the possibility to include additional 
signals (e.g., heart rate, electromyogram, and electrodermal activity). For a recent review 
of PA measurement, see Warren and colleagues (2010). For an overview of available 
devices (links/hardware), see www.ambulatory-assessment.org. 
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Activity Counters 


A so-called activity counter is based on counting suprathreshold movements using 
mechanical sensors or signals from miniature movement sensors (Corder, Brage, & 
Ekelund, 2007; de Vries et al., 2006; Rapport et al., 2006; Zheng, Black, & Harris, 
2005). In this way the number of movements (movement counters) and/or steps (step 
counters, pedometers) is measured. Once the threshold has been passed, the count is inde- 
pendent of the movement intensity. Nowadays, most devices to measure general activity 
or movement also include movement intensity (see next section), but the described coun- 
ter technology is still common in step counters. The validity of step counters is a major 
issue, because steps and step detection vary highly within and between subjects, because 
of the influence of walking speed, walking surface, walking aids, and walking pattern. 
Despite these limitations, many studies still use activity counter technology, since 
devices generally are inexpensive, simple, and therefore practical for large studies. 


Actometers 


Actometers or actigraphs (Figure 13.3) are sometimes referred to as full proportional 
actigraphs, because they not only measure whether a person is active, but also provide 
data that are indeed fully proportional to the intensity of PA. This methodology has 
experienced a key breakthrough, thanks to the development and availability of small and 
energy-efficient miniature accelerometers. These devices measure acceleration, which is 
change in velocity over time, and can describe intensity, rate of occurrence, and duration 
of PA (Corder et al., 2007). For a review on uniaxial accelerative devices, see Chen and 
Bassett (2005; de Vries et al., 2006), and for research recommendations, consult Chen 
and Bassett (2005; de Vries et al., 2006; Trost, McIver, & Pate, 2005; Ward et al., 2005). 
Actometers are generally one-unit devices, mostly matchbox size, containing battery, sen- 
sors, signal processing unit, and storage. Actometers are mostly worn around the waist or 
wrist. Usually, data are integrated and converted before being stored as an activity score 
per freely definable time interval. In most cases, raw data cannot be stored; if so, then 
only for short periods. Data can be transferred to a PC for additional data analysis. Some 
recorders can consider an additional signal (e.g., temperature, light, or sound). Actome- 
ters are extensively used and well-validated for determining sleep-wake cycles. Most 
actometers include an algorithm with which acceleration output is converted to energy 
expenditure. Two critical aspects of actometers are that intensity or energy expenditure 
of some activities is overestimated, whereas other activities are underestimated (Chen 
& Bassett, 2005; de Vries et al., 2006), and they cannot differentiate between different 
postures or movements. 


Multichannel Ambulatory Devices 
HARDWARE 


Multichannel ambulatory movement sensor devices (Figure 13.3) record signals contin- 
uously. The signals are processed and converted into indices of activity, posture, and 
motion patterns. They can not only detect different types of postures (e.g., sitting, stand- 
ing, lying) and movements (e.g., walking, cycling, driving a wheelchair) but also quan- 
titatively represent qualitative features of postures and carried out movements (balance, 
symmetry, speed, etc.). 
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FIGURE 13.3. Examples of an actometer: (a) the GT3X Actigraph and multichannel ambulatory 
devices, (b) the Vitaport Activity Monitor, and (c) its successor, the VitaMove system. 


Capturing posture and motion has substantially progressed over the last two decades 
due to several major innovations. First of all, the development of new, small, and fea- 
sible sensors (e.g., piezoresistive accelerometers) is noteworthy. They react to both gravi- 
tational acceleration (giving inclination information) and inertial accelerations due to 
movement. This characteristic is important for differentiating postures and movements 
(for details, see Chen & Bassett, 2005). Another innovation is the development of wire- 
less data transfer and pocket-size digital data recorders (Ebner-Priemer & Kubiak, 2007; 
Reilly et al., 2008), with growing data processing and storage capacity (Figure 13.3). 
Higher computer capacity also enables use of advanced methods of postmeasurement 
signal analysis. Software has been created for automatic classification of motion pat- 
terns in multichannel recordings. Most multichannel ambulatory accelerometry devices 
can use all types of sensors, which can be applied to different body locations (arm, leg, 
or waist). Sensors are connected to a data logger (mostly wired, but wireless devices are 
increasingly appearing now), where data are stored or programmable software performs 
real-time analysis. Some of these devices link PA with other physiological signals, such as 
respiration rate, ECG (electrocardiography), or EMG (electromyography). 
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ALGORITHMS FOR DETECTION OF POSTURE AND MOTION PATTERNS 


The validity of the activity, posture, and movement data is partly determined by the qual- 
ity of the device (e.g., sensors, cables, data transmission, data recorder). At least equally 
important, though, is the way data are processed after the measurement. Preece, Goul- 
ermas, Kenney, and Howard (2008) provide a review of different algorithms. Bussmann, 
Ebner-Priemer, and Fahrenberg (2009) discuss two approaches in more detail: fixed- 
threshold classification (based on preset and activity-specific feature settings) and refer- 
ence pattern—based classification (using individual reference patterns for each posture 
and activity). 

Examples of the first category are the Vitaport Activity Monitor (TEMEC Instru- 
ments; Bussmann et al., 2001), the DynaPort System (McRoberts), the Intelligent Device 
for Energy Expenditure and Physical Activity (IDEEA®; MiniSun LLC, Fresno, Califor- 
nia; Zhang, Werner, Sun, Pi-Sunyer, & Boozer, 2003), and the LifeShirt® system. The 
Freiburg Monitoring System (FMS; Myrtek, Foerster, & Bruegner, 2001), based on the 
VarioPort System (Becker MediTec), employs a reference pattern—based classification. 
Recent software systems generally permit automatic classification of common activity 
pattern in 24-hour or 48-hour recordings within a few minutes and without interactive 
editing (Bussmann et al., 2001; Myrtek et al., 2001). The software output for every seg- 
ment, such as each minute, includes the most prominent activity pattern (e.g., lying or 
sitting), as well as an activity intensity score. Strategies to clarify the most prominent 
activity pattern differ by system. 


Methodological Issues 


Current devices and signal analysis software offer many possibilities for the objective 
assessment of movement behavior in psychology and psychophysiology. However, some 
limitations and critical issues need consideration. Although some of them are general (i.e., 
they also apply to other ambulatory assessment methods) and are discussed elsewhere in 
this volume, we discuss them here in the specific context of movement behavior. 


Sampling Issues and the Time-Based Design 


The time-based design is a fundamental part of ambulatory activity monitoring. It is 
defined by the number of assessment points, the intervals between the assessment points, 
the time interval over which data are integrated, and the total length of the assessment 
period. The number of assessment points and the time between them are not a critical 
aspect, as the activity signal can be sampled and saved at very high frequencies. Activ- 
ity counters and actometers integrate data over fixed intervals, which may range from 
seconds to several minutes. When data are not averaged, sampling and saving raw data 
at, minimally, 32 Hz are standard. The key issue in the time-based design is the length 
of the total assessment period. Although battery capacity and participant burden reduce 
the possible length of the total assessment, longer periods are favored because variability 
between days can be high, especially between weekdays and the weekend (Rowlands, 
Pilgrim, & Eston, 2008). Naturally, actometers are advantageous in this regard because 
burden and battery capacity are of less importance compared to multichannel ambula- 
tory devices. Individual suggestions regarding a minimum total assessment period vary: 
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10 days or 2 weeks: Tryon (2006); three 24-hour periods: Littner and colleagues (2003); 
3-7 days: Reilly and colleagues (2008). International guidelines have not yet been cre- 
ated. 


Reactivity and Compliance 


The issue of measurement reactivity with intensive longitudinal data is discussed in more 
detail elsewhere (see Barta, Tennen, & Litt, Chapter 6, this volume). From the viewpoint 
of ambulatory activity monitoring, the outcomes of studies evaluating comfort and the 
effect of measurement on the data of interest have, overall, been satisfying. For example, 
Ebner-Priemer and colleagues (2007) interviewed participants after a 24-hour ambula- 
tory activity monitoring session (e.g., whether they experienced higher attention to bodily 
sensations, whether they found the device to be unpleasant). Responses showed that par- 
ticipants were not overly affected by the monitoring itself. Ratings for distress, as well as 
reactivity caused by the device, were minimal. 

However, an example of reactivity to the experience of being monitored was given 
by Costa, Cropley, Griffith, and Steptoe (1999), who observed the impact of participat- 
ing in 24-hour ambulatory blood pressure (BP) monitoring. Physical activity was assessed 
by use of triaxial accelerometers on the day of the BP monitoring and on another day 
without BP monitoring. PA was significantly lower during the BP-monitoring day. This is 
likely in part a result of periods of immobility during BP measurement and diary comple- 
tion, and in part due to more general, self-imposed restrictions on movement and activity. 
Fortunately, newer devices are lighter and much smaller, and do not have to be taken off 
during sport activities. Clemes, Matchett, and Wane (2008) investigated the influence 
of wearing a pedometer and recording daily step counts on ambulatory activity. Partici- 
pants, blinded to the study aim, wore a sealed pedometer (that provided no feedback on 
the number of steps) for the first week and an unsealed pedometer for a second week. 
Comparing both conditions revealed a main effect, with significantly lower daily step 
counts reported in the sealed condition than in the unsealed condition. In a follow-up 
study, Clemes and Parker (2009) tried to delineate different factors that may contribute 
to this reactivity. Participants had to wear the pedometer in three different ways, each 
for 1 week: sealed, unsealed, and unsealed plus logging daily steps in an activity diary. A 
significant overall effect of condition emerged, with post hoc analyses showing the great- 
est increase in step counts in the diary condition. This suggests that reactivity to pedom- 
eters is greatest when participants are requested to wear an unsealed pedometer and 
record their step counts. This highlights the necessity to include reactivity considerations 
into the study design to maximize data quality. In a study by Bussmann and colleagues 
(2010) patients with spinal cord injury had to wear a multichannel system, and interviews 
showed that the device did not affect the amount of wheelchair driving, although patients 
reported some burden. 

Another mutual concern is that older adult participants are unable or unwilling to 
use high-tech devices. However, Smeja and colleagues (1999) examined patients with 
Parkinson’s disease, the oldest being 82, and obtained good results. Also, in a study by 
Karunanithi (2007), healthy subjects ranging from 80 to 86 years in age wore accelerative 
devices continuously for 2-3 months. All subjects found the devices easy to use, unob- 
trusive, and comfortable to wear, and the compliance rate (number of days worn) was 


good (88%). 
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Reliability and Validity from a Technology Point of View 


Sensors that are validated from a technological point of view do not automatically pro- 
vide valid outcome parameters. Rather, technological validation and/or reliability testing 
is a critical first step in the validation process. The technological reliability of ambulatory 
activity monitoring devices is usually high. For example, Tryon and Williams (1996) and 
Tryon (2005) used a pendulum in the laboratory setting to test the reliability of multiple 
actigraphs, and the resulting reliability was found to be between 97.5 and 99.4%. How- 
ever, older or cheaper devices may produce less favorable reliabilities. Actometers are 
subject to other specific validity threats such as over- or underestimation during specific 
activities and the biasing effect of external vibration. Independent from the approach of 
algorithms for detection of posture and motion patterns, misclassifications of postures 
and movement patterns may occur, especially when patterns of movement or postures are 
similar, such as in walking and climbing stairs (Foerster, 2001), and in cases where sub- 
jects engage in postures or movements quite different from the set of predefined postures 
or motions. Identification of situations such as walking up stairs or using a lift can be 
increased by precisely measured barometric pressure (Ebner-Priemer, 2004). 


Setting and Background of Behavior 


Real-life assessment is a key advantage of ambulatory activity assessment that makes 
experimental induction unnecessary and improves both validity and generalizability. But 
there are some concerns for validity in ambulatory activity monitoring. First, in most 
of the studies, researchers do not know about the actual setting, nor do they have any 
influence on the physical and social environment. For example, an irregular walking 
pattern can result from misbalance while walking on a flat surface, but it may also be 
due to an appropriate reaction to an uneven sidewalk. Additionally, the idiosyncratic 
reasons or motivations for engaging in or refraining from activity are largely unknown. 
For example, there may be multiple reasons for low physical activity measured objectively 
by accelerometry: general laziness or an upcoming examination that ties the participant 
to his books. Combining actigraphs with electronic diaries that assess context variables 
might be helpful. Additionally, stored variables (light, noise, temperature) may be useful 
for recognizing situations and present additional contextual information. 


Outcome Measures 


Ambulatory monitoring of posture and movement is a method of measurement. It allows 
long-term and objective assessment of activity, postures, and motions, and is likely to 
provide reliable and valid parameters. However, an extrapolation from a reliable and 
valid technique (ambulatory monitoring of posture and movement) to a reliable and valid 
outcome measure is certainly questionable. Measuring global activity, for example, might 
not be the best and most valid parameter when psychomotor agitation in depression is 
of interest. Similarly, Mathiassen (2006) suggested that the negative effects of repetitive 
operations (behavior) on musculoskeletal disorders are not reflected in an overall measure 
of activity level, but in the distribution (variation) of specific activities over the day. One 
of the future challenges of ambulatory monitoring of posture and movement is to opera- 
tionalize parameters that really and uniquely express the aspect of movement behavior 
that is of interest. 
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Selected Examples of Ambulatory Behavior Monitoring 
in Psychology and Related Fields 


The reasons for applying behavior monitoring can be idiosyncratic. We distinguish four 
major aims. The first aim is diagnostics. A physical or mental disease may affect move- 
ment behavior, and information on that may help the clinician or therapist to diagnose 
the disorder and/or its consequences, and to decide on the most appropriate treatment. 
A second aim is feedback. Behavior monitoring can be used as a mean for providing 
feedback, for example, to stimulate people to become more active or to teach people to 
distribute their activities more efficiently. Third, effect evaluation can be a goal. Many 
interventions exist that have movement behavior as one of the primary or secondary 
outcomes. Finally, behavior monitoring can be used to solve the problem of method- 
ological confounds. Many physiological parameters—such as blood pressure, ECG, and 
breath rate—depend on the posture and movement that is performed. In studies focus- 
ing on these physiological and mental parameters, insight into the simultaneously mea- 
sured movement behavior may facilitate their interpretation. To show some applications 
of ambulatory activity monitoring in psychology and related fields, and at the same time 
show their different backgrounds, we provide some selected examples. 


Posture and Motion in Psychiatric Disorders 


Altered physical activity is a common criterion for psychiatric disorders. Teicher (1995) 
reported 30 different psychiatric disorders with increased or decreased activity, whereas 
Tryon (2006) delineated 48. Well-known examples are attention deficit/hyperactivity 
disorder (ADHD), schizophrenia (disorganized or catatonic behavior), major depression 
(psychomotor retardation or agitation), and bipolar disorders (psychomotor agitation 
during manic episodes). We believe that if a variable is sufficiently important to consti- 
tute an inclusion criterion for a diagnostic category, then it is, by definition, an important 
outcome variable that should be monitored to evaluate the effectiveness of therapeutic 
interventions. Unfortunately, most studies on activity in psychiatric disorders rely on self- 
report measures, even though discrepancies between self-report of activities and objective 
measurement of physical activities are well established. 

In major depressive disorders, in which psychomotor retardation is a defining crite- 
rion, Volkers and colleagues (2003) reported a lower level of motor activity and longer 
periods of inactivity during the waking hours, as well as a higher level of motor activity 
and decreased immobility during sleep. Interestingly, treatment studies examining effects 
of various antidepressants found asynchronous effects, as clinical ratings of retardation 
were not closely correlated with changes in activity pattern over the course of treatment 
(Volkers, Tulen, van Den Broek, et al., 2002). Similar asynchronous treatment effects 
have been found in schizophrenia (Apiquian et al., 2008), highlighting the necessity to go 
beyond self-reports in assessing PA. 

Ambulatory activity monitoring has proven useful especially with respect to 
situation-specific hyperactivity in children with ADHD (Porrino et al., 1983; Teicher, 
1995). Heightened activity was found during structured school tasks but not during free- 
time events (Porrino et al., 1983). In a pilot study, Teicher (1995) examined the distri- 
bution of activities and found that children with ADHD had few periods of low-level 
activity during the day. They concluded that hyperactivity in ADHD seems to be more 
characterized by the relative absence of quiet periods than the presence of extreme activ- 
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ity. Such a pattern of behavior is less likely to be revealed by self-reports. Multiple studies 
have investigated the effects of medication on hyperactivity and sleep in patients with 
ADHD using actigraphy (Boonstra et al., 2007; Corkum, Panton, Ironside, Macpherson, 
& Williams, 2008). Recently, Tryon, Tryon, Kazlausky, Gruen, and Swanson (2006) 
reported the use of an interactive ambulatory feedback system. To decrease motor excess 
in children with ADHD, they programmed actigraphs to reinforce low PA during school 
periods. In a pre- and posttreatment research design, Tryon et al. demonstrated signifi- 
cant activity level reductions using this interactive ambulatory feedback system. 


Mood and Personality Research 


That physical activity is related to mood has often been reported, but such studies have 
mostly relied on questionnaires that assessed the relation between mood and activity ret- 
rospectively. In a relatively recent study, Schwerdtfeger, Eberhardt, and Chmitorz (2008) 
examined this association prospectively using ambulatory assessment: accelerometers for 
PA and electronic diaries for mood. Mixed-model analyses confirmed earlier studies: 
Energetic arousal/positive affect was significantly associated with preceding PA episodes, 
suggesting that daily PA episodes modulate mood. Associations between mood and activ- 
ity pattern are also of interest in disorders with movement pathologies, such as Parkin- 
son’s disease. Psychological research has helped researchers to understand how tremor 
activity is affected by change of posture, time of day and night, and especially emotional 
events. Thielgen, Foerster, Fuchs, Hornig, and Fahrenberg (2004) investigated distinct 
psychophysiological episodes in which the tremor was noticeably enhanced by emotional 
activation or mental effort in patients with Parkinson’s disease. Physical activity is also 
related to personality dimensions. Volkers, Tulen, Duivenvoorden, and colleagues (2002) 
studied the effects of personality dimensions on 24-hour activity patterns in 101 healthy 
subjects. Random regression models showed that subjects scoring high on harm avoid- 
ance had lower activity levels than subjects scoring low on harm avoidance, and subjects 
scoring high on reward dependence had higher overall levels of motor activity than sub- 
jects who scored low on reward dependence. 


Physical Activity and Posture as Confounding Variables 
in Psychophysiology 


As described in the previous paragraph, PA and posture can be important confounding 
variables in general, but especially in psychophysiology (see also F. Wilhelm, Grossman, 
& Müller, Chapter 12, this volume). Real-time analysis of physical activity and physi- 
ological signals allows partialing out of emotional and physical influences concurrently 
at the time of assessment. Myrtek (2004), for example, monitored heart rate and physi- 
cal activity in daily life, and separated out—in real time—heart rate increases caused by 
PA. The remaining additional heart rate was assumed to indicate momentary emotional 
activation or mental load. The recorder/analyzer was programmed to trigger a handheld 
computer, which in turn signaled the participant to provide a self-report on momentary 
activity, situation, and emotion. This occurred when the additional heart rate exceeded 
a certain threshold. Control periods were obtained by randomly interspersed trigger sig- 
nals. The algorithm was used and validated in a series of studies with different popula- 
tions and around 1,300 participants (Myrtek, 2004). 
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Posture and Motion in Overweight and Obesity 


In health psychology there is strong interest in promoting and assessing PA because of 
the worldwide increase in obesity. Accelerometry sensors are increasingly used to reveal 
activity patterns. For example, Cooper, Page, Fox, and Misson (2000) demonstrated that 
obese participants were less active than nonobese adults, particularly when participants 
were unconstrained by the weekday routine and free to choose their level of activity, such 
as on weekends. Studies have used the amount of physical activity performed as feedback 
for promoting increases in physical activity. For example, Butcher, Fairclough, Stratton, 
and Richardson (2007) showed that participants in the feedback group, who were free 
to view their step counts, made significantly more steps than those in the control group, 
who received no step count feedback. For reviews on the use of accelerometry for esti- 
mating energy expenditure, see Corder and colleagues (2007), Chen and Bassett (2005), 
Trost and colleagues (2005), and Ward and colleagues (2005). For links to commer- 
cially available portable biofeedback devices that monitor continuously the PA of their 
users and provide visual feedback on the time spent in different activities, see the www. 
ambulatory-assessment.org website (Links/Hardware). 


Summary and Perspectives 


Even though ambulatory activity monitoring has proven to be invaluable and widely 
applicable in various fields of research, it is still underutilized. However, researchers are 
becoming increasingly aware of its many advantages and potential fields of application 
(Bussmann & Stam, 1998; Bussmann et al., 2001; Fahrenberg, Mueller, Foerster, & 
Smeja, 1996; Fahrenberg & Myrtek, 2001; Tryon, 2006). In searching medical data- 
bases, the increasing number of hits shows that ambulatory assessment of PA is steadily 
becoming more popular in medicine (Troiano, 2005). There are multiple reasons for this 
increase. First, the U.S. government identified PA as the number one health indicator in 
the road map initiative Healthy People 2010 (U.S. Department of Health and Human 
Services, 2000). Second, interest in ambulatory monitoring (Fahrenberg et al., 2007) is 
generally increasing. Third, due to rapid progress in the development of sensors, data 
recording technologies, and advanced computer algorithms, ambulatory assessment of 
PA and movement has experienced a critical breakthrough as a mainstream methodol- 
ogy. Advanced feedback algorithms are already available: online detection of motion pat- 
terns and transmission via mobile phones to intelligent systems that support online feed- 
back to promote modification of activity patterns. Examples are continuous feedback of 
energy expenditure to promote weight loss among obese people, and providing feedback 
on hyperactivity in patients with ADHD as biofeedback in everyday life. Although most 
psychologists still seem reluctant to use ambulatory activity monitoring, laypeople are 
already using pedometers extensively for monitoring their activity in daily life. Pedom- 
eters are readily available in discount stores, and telecommunication companies offer 
software upgrades that enable mobile phones to measure daily steps. We are convinced 
that further technological progress with smaller devices and improved battery power will 
facilitate long-term monitoring. This might help with making the methodology attractive 
not only for the lay population but also for psychological scientists who seek to investi- 
gate and explain behavior in its natural context. 
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CHAPTER 14 


Passive Telemetric Monitoring 
Novel Methods 


for Real-World Behavioral Assessment 


MATTHEW S. GOODWIN 


his chapter describes an emerging class of information systems technology called pas- 

sive telemetrics, which includes “wearable” and “ubiquitous” computers that record 
behavioral, physiological, and environmental data from sensors worn on the body or 
embedded in the environment. Passive telemetrics can enable social and behavioral sci- 
entists to gather remotely the types of data needed for intensive longitudinal studies. 
The chapter is organized in the following manner. The first section provides a general 
overview of passive telemetrics, including advances in wearable and ubiquitous comput- 
ing technology. The second section discusses advantages of using passive telemetrics for 
behavioral assessment and illustrates with applied examples how this technology is being 
used by social and behavioral scientists to measure and intervene on a wide variety of 
behaviors in natural settings. The final section discusses ethical, practical, statistical, and 
measurement issues that must be considered when using passive telemetrics, and high- 
lights future research needed in these areas. 


Passive Telemetrics 


Passive telemetrics utilizes unobtrusive technology to measure and transmit behavioral, 
physiological, and environmental data wirelessly from remote locations automatically, 
using sensors worn on the body and/or embedded in the environment. In the past 10 
years, researchers in academia (Massachusetts Institute of Technology, Georgia Tech, 
Carnegie Mellon, and Stanford, among others), government (NASA, among others), and 
industry (Intel and Hitachi, among others) in the United States, Europe, Australia, and 
Asia have developed unobtrusive wireless technologies that can be weaved into clothes 
and mounted discreetly in the environment. These two initiatives have shaped emerging 
computer science and engineering fields known as wearable computing (or wearables) 
and ubiquitous computing (or ubicomp). 
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Wearable Computing 


Breaking away from the traditional desktop computer paradigm, computer scientists 
are actively developing computational systems that are as mobile as their users (Starner, 
2001). The extreme process of miniaturization in both electronic circuits and processors 
is enabling passive (i.e., no conscious input from the wearer required), on-body percep- 
tion systems to be sewn into articles of clothing and embedded into wearable accessories 
such as shoes, gloves, and jewelry (e.g., Fletcher et al., 2010; Healey, 2000). As can be 
seen in Figure 14.1, small, wireless physiological sensors have been developed to record 
cardiovascular, respiratory, skin conductivity, and muscle activity unobtrusively in freely 
moving people. Miniature actigraphs and accelerometers capable of objectively quantify- 
ing physical activities, including body postures (e.g., lying, sitting, standing) and dynamic 
activities (e.g., walking, climbing stairs, postural transitions), are being embedded in 
wristbands, bracelets, and belts (e.g., Bao & Intille, 2004). Audio capture technologies 
that utilize small, unobtrusive microphones are being integrated into wearable computing 
platforms to record sounds created by the user (e.g., speech, gestures) and ambient audi- 
tory events in the environment (e.g., Mehl, Pennebaker, Crow, Dabbs, & Price, 2001). 
And discreet, infrared, sensitive cameras are being integrated into eyeglasses to determine 
where and what a user is looking at (e.g., Dickie et al., 2004). 


Ubiquitous Computing 


In 1991, Mark Weiser—head of the Computing Science Laboratory at the Xerox Palo 
Alto Research Center—introduced the area of ubiquitous computing when he suggested 
development of technologies that “weave themselves into the fabric of everyday life until 
they are indistinguishable from it” (p. 94). Since Weiser’s decree, researchers have embed- 


FIGURE 14.1. MIThril system, composed of the Zaurus PDA (right) with Hoarder sensor hub 
and physiological sensing board (top), electrocardiography (ECG)/electromyograph (EMG)/gal- 
vanic skin response (GSR)/temperature electrodes and sensors (left), and combined three-axis 
accelerometer and infrared tag reader (bottom). From Sung and Pentland (2004). Reproduced 
with permission from Alex Pentland. 
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FIGURE 14.2. The PlaceLab living room and kitchen, office, and master bath. All of the observa- 
tional sensing is built directly into the cabinetry. Although the sensors are ubiquitous, they become 
part of the design aesthetic (small black windows). The inset in the left image shows a microphone. 
From Intille et al. (2005). Reprinted with permission from the Association for Computing Machin- 
ery, Inc. 


ded ubiquitous sensing technologies in the environment to develop “living laboratories” 
and “smart rooms” capable of wirelessly and passively monitoring surroundings and 
inhabitant behavior (e.g., Chan, Campo, Estéve, & Fourniols, 2009). As seen in Figure 
14.2, dense arrays of unobservable telemetric sensing devices are typically embedded 
in these settings to record interior conditions such as temperature, humidity, light, and 
barometric pressure. Discreetly mounted microphones, video cameras, and motion sen- 
sors are used to record what people say, activities in which they are involved, and where 
they go. And even less obtrusive sensors are being explored, including force-sensitive 
load tiles to identify and locate a person based solely on his or her footsteps (Orr & 
Abowd, 2000), radio frequency identification (RFID) tags! attached to common items to 
detect person-object interactions (Munguia Tapia, Intille, & Larson, 2004), and cameras 
embedded in mirrors that enable noncontact recording of cardiovascular activity from 
the face (Poh, McDuff, & Picard, 2010). All of these sensing devices can automatically 
and continuously relay their recordings to a network of computers to produce a single, 
synchronized dataset for researchers to explore. 


Passive Telemetric Monitoring 
in the Social and Behavior Sciences 


Utilizing wearable and ubiquitous sensors, social and behavioral scientists can now con- 
figure sophisticated methodological tools to telemetrically monitor an individual’s behav- 
ioral, affective, and physiological responses unobtrusively and repeatedly in a wide range 
of real-world settings. For instance, Nusser, Intille, and Maitra (2006) claim: 


With the technology available today, a comfortable device can be created that collects con- 
tinuous streams of video data describing everything the subject sees, audio data of everything 
the subject hears and says, accelerometer data of the subject’s limb motion, data on physi- 
ological parameters such as heart rate, data on the subject’s location in the community, as 


254 STUDY DESIGN CONSIDERATIONS AND METHODS OF DATA COLLECTION 


well as other miscellaneous data about how the subject is feeling, as reported by the subject 
via a mobile computing device user interface. (p. 254) 


All this information can be simultaneously recorded, time stamped, and synchronized, so 
that contextual and temporal relationships can be ascertained and verified. 

Passive telemetric monitoring supports at least four key advances in behavioral assess- 
ments. First, passive telemetric sampling devices can store an unprecedented amount of 
longitudinal data appropriate for idiographic analyses. Second, discreet, wireless sensors 
worn on the body and embedded in the environment can decrease measurement reactivity 
and increase ecological validity by recording information about a freely moving person’s 
location, intensive physiological states, and activities comfortably, continuously, and pas- 
sively. Third, when self-reports are desired, researchers can configure a “context-aware” 
mobile computing device that integrates passive sensors (e.g., heart rate, global position- 
ing system [GPS], accelerometry) with handheld devices to issue a prompt for more infor- 
mation only when a person is in a specific location, engaged in a particular activity, or 
experiencing a certain behavioral or physiological response (Nusser et al., 2006). Finally, 
pattern recognition algorithms (Theodoridis & Koutroumbas, 2006) can run on context- 
aware mobile computing devices to tailor questions, trigger a response, or provide feed- 
back based on a person’s physical and physiological patterns, reported subjective states, 
or interpersonal experiences. 

A few exemplary applications of passive telemetric monitoring are presented below 
to illustrate how this class of technology can enhance social and behavioral scientific 
efforts to measure and intervene on a wide variety of behaviors in natural settings. 


Adherence and Compliance Monitoring 


The full benefit of any effective treatment can be achieved only if prescribed treatment 
regimens are followed. Unfortunately, average adherence rates as low as 43% are com- 
monly found in patients receiving treatment for chronic conditions (Claxton, Cramer, & 
Pierce, 2001). Poor adherence to prescribed treatment contributes to substantial worsen- 
ing of morbidity and mortality, and to increased health care costs in the United States. For 
instance, 33 to 69% of all medication-related hospital admissions in the United States are 
a result of poor medication adherence, costing society approximately $100 billion a year 
(McDonnell & Jacobs, 2002). Medication adherence has traditionally been estimated by 
counting returned, unused medications (pill counts); by interviewing patients; or by hav- 
ing patients complete diaries and questionnaires. All of these methods are strongly biased 
toward overestimation of adherence due to social pressure for the patient to appear com- 
pliant to the physician (Elixhauser, Eisen, Romeis, & Homan, 1990). While directly 
observing adherence is a more accurate sampling strategy, this method is often expensive 
and burdensome for the health care provider (Osterberg & Blaschke, 2005). 

Passive telemetric devices may be useful in improving adherence. For example, Fish- 
kin and Wang (2003) developed a portable monitoring pad that combines RFID tags with 
a sensitive scale to detect when people pick up and put down pill bottles, and to measure 
how their weight has changed. Coupled with a wireless, network-enabled tablet display, 
the system can proactively remind the user, and people in his or her social or caregiver 
network, which pills to take, how many, and when. 
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Medical Monitoring 


According to data from the United Nations Population Division (2001), approximately 
606 million people in the world were over 60 in the year 2000; by 2050, this number is 
expected to reach 2 billion people. Current health care systems dominated by expensive 
and infrequent patient visits to physician offices cannot adequately service the growing 
needs of the aging population. Combining the finding that quality of life for people in 
their homes is generally better than that for people in institutions (Rivlin & Wiener, 
1988) with the high cost for institutional care relative to home care, medical providers are 
turning to passive telemetric technologies to provide better monitoring and feedback for 
detecting and correcting health problems when they first appear in the home environment 
(e.g., Mynatt, Essa, & Rogers, 2000). 

Combining multivariate sensor streams with pattern-recognition and machine- 
learning techniques, these systems can continuously record a user’s vital signs, motor 
activity, social interactions, sleep patterns, and other health indicators to monitor well- 
being, trigger reminders, and broadcast relevant information to caregivers. For instance, 
the LiveNet system created at the Massachusetts Institute of Technology (MIT) Media 
Lab combines a customizable personal digital assistant (PDA)-based wearable platform, a 
software and network resource discovery application program, and a real-time machine- 
learning inference infrastructure to gather a variety of sensor data, perform real-time 
processing and mining on this data, return classification results and statistics, trigger 
alarms, and enable peer-to-peer wireless networking. Sung, Marci, and Pentland (2005) 
highlight the benefits of using passive telemetric technology for medical monitoring and 
review many promising applications of these devices. First, such powerful diagnostic sens- 
ing technologies can enable doctors to obtain more context-specific information directly 
instead of relying on a patient’s recollection of past events and symptoms, which tend to 
be vague, incomplete, and error prone. Second, unlike infrequent surveys and interviews, 
continuous monitoring can enable very fine-grained quantitative data to be obtained. 
Third, wireless monitoring of behavior in the natural environment can ensure capture 
of relevant events and concomitant physiology wherever the patient is, expanding the 
view of health care beyond the traditional inpatient setting. A few of the potential medi- 
cal applications they highlight include (1) monitoring movement states of patients with 
Parkinson’s disease to establish a timeline of symptom severity and motor complications; 
(2) characterizing and identifying epileptic seizures through accelerometry to develop an 
ambulatory monitor with a real-time seizure classifier; (3) monitoring physiological and 
behavioral reactions to tailor medications to an individual; and (4) developing physi- 
ological and behavioral measures to classify emotional states associated with preclinical 
symptoms of psychosis, mood, anxiety, and personality disorders. 


Autism Spectrum Disorders 


Autism spectrum disorders (ASDs) are a collection of neurodevelopmental disorders char- 
acterized by qualitative impairments in socialization, communication, and circumscribed 
interests, including stereotypical behavior patterns and behavioral rigidity to changes in 
routines (American Psychiatric Association, 1994). Current epidemiological studies of 
ASD suggest a rate as high as 1 in 110 in children age 8 years in the United States (Cen- 
ters for Disease Control and Prevention, 2009). ASDs typically manifest in infancy and 
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persist throughout the lifespan. These disorders have a profound impact on families, and 
often result in enormous emotional and financial costs. For instance, recent estimates 
suggest that the societal costs in the United States to care for all individuals diagnosed 
each year over their lifetimes approaches $35 billion (Ganz, 2007). 

Through our Autism and Communication Technology Initiative at the MIT Media 
Lab, we are developing a variety of passive telemetric technologies to understand better 
and support individuals with ASD in natural environments. Two of these applications, 
described below, include automatically detecting stereotypical motor movements using 
wireless accelerometers and pattern recognition algorithms, and quantifying early behav- 
ioral manifestations of ASD in the home using ultradense video and audio data-capture 
technologies. While these projects are works in progress, they demonstrate emerging 
capabilities of passive telemetrics and illustrate several ways that this technology can 
improve the study of complex behaviors in real-world settings. 


SENSOR-ENABLED DETECTION OF STEREOTYPICAL MOTOR MOVEMENTS 


Stereotypical motor movements (SMMs) are generally defined as repetitive motor 
sequences that appear to an observer to be invariant in form and without any obvious 
eliciting stimulus or adaptive function. Several SMMs have been identified, the most 
prevalent among them being body rocking, mouthing, and complex hand and finger 
movements (e.g., Goldman et al., 2009). SMMs occur frequently in people with mental 
retardation, developmental disabilities, and genetic syndromes (Bodfish, Symons, Parker, 
& Lewis, 2000), and less frequently in typically developing children and adults. 

When severe, SMMs can present several difficulties for individuals with ASD and 
their families. First, persons with ASD often engage in SMMs. Preventing or stopping 
these movements can be problematic, because individuals with ASD may become anx- 
ious, agitated, or aggressive if they are interrupted (Gordon, 2000). Second, if unregu- 
lated, SMMs can become the dominant behavior in the repertoire of an individual with 
ASD and interfere with the acquisition of new skills and performance of established skills 
(e.g., Koegel & Covert, 1972). Third, engagement in these movements is socially inappro- 
priate and stigmatizing, and can complicate social integration in school and community 
settings (Jones, Wint, & Ellis, 1990). Finally, SMMs are thought to lead to self-injurious 
behavior under certain environmental conditions (Kennedy, 2002). 

Traditional measures of SMMs rely primarily on paper-and-pencil rating scales, 
direct behavioral observation, video-based methods, and kinematic analyses, all of which 
have limitations (Sprague & Newell, 1996). Paper-and-pencil rating scales are subjective, 
can have questionable accuracy, and fail to capture intraindividual variations in the form, 
amount, and duration of SMMs. The following factors can make direct observational 
measures unreliable: (1) reduced accuracy in observing and documenting high-speed 
motor sequences; (2) difficulty determining when a sequence has started and ended; (3) 
observer errors attributable to memory and the ability to estimate the amount of SMM 
activity over a finite period of time; (4) limitations in the ability to observe concomitantly 
occurring SMMs; and (5) limitations in the ability to note environmental antecedents and 
record SMM activity at the same time. Video-based methods and kinematic analyses are 
often highly accurate and reliable; however, they are tedious and time-consuming, which 
makes them impractical for most clinicians to use in applied settings on a regular basis. 
The necessity to code videos offline also precludes real-time monitoring. 
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FIGURE 14.3. The MITes three-axis wireless accelerometer sensors housed in plastic cases with 
external battery holder. 


In response to these problems, we are developing a passive telemetric system that 
automatically recognizes and monitors SMM activity that is more objective, detailed, 
and precise than rating scales and direct behavioral observation, and more time-efficient 
and mobile than video-based methods and kinematic analyses. As seen in Figure 14.3, 
our system uses miniature, low-cost, three-axis wireless accelerometers (Munguia Tapia, 
Marmasse, Intille, & Larson, 2004) that are comfortably worn on an individual’s wrists 
and torso and wirelessly transmit motion data to a mobile phone. Pattern recognition 
algorithms running on the phone receive these acceleration streams, as seen in Figure 
14.4, compute a variety of time and frequency domain features, and automatically detect 
SMM topography, onset, offset, frequency, duration, and intensity. In pilot work, the 
system correctly identified (i.e., compared to “gold standard,” offline, frame-by-frame 
video-coded records) stereotypical body rocking, hand flapping, and head hitting approx- 
imately 90% of the time across six individuals with ASD in both laboratory and class- 
room settings (Albinali, Goodwin, & Intille, 2009). 

There are several potential benefits associated with this passive telemetric system. 
The tool may provide a convenient, low-cost alternative or supplement to potentially lim- 
iting rating scales, direct observation, and video encoding methods. Easily automating 
SMM detection could free a human observer to concentrate on and note environmental 
antecedents and consequences necessary to determine functional relations that exist for 
this perplexing and often disruptive class of behavior. The system could also be used 
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FIGURE 14.4. Acceleration plot showing hand flapping and body rocking stereotypical motor 
movements for one participant with ASD. Top: Left wrist. Middle: Right wrist. Bottom: Torso. 
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as an outcome measure to facilitate efficacy studies of behavioral and pharmacological 
interventions intended to decrease the incidence or severity of SMMs. Finally, with minor 
modifications, the system could be programmed to serve as an intervention tool by pro- 
viding real-time feedback to individuals with ASD and/or their caregivers when SMMs 
are detected. 


ULTRADENSE VIDEO AND AUDIO DATA CAPTURE IN THE HOME 


Some of the most important decisions to be made in the future concerning ASD relate to 
how early it can be detected and to what end. Criteria for diagnosing ASD rely on behav- 
ioral features, since no reliable biological markers have yet to be established. Currently, 
ASD is not diagnosed until a child is 3 to 5 years of age, despite a growing body of litera- 
ture suggesting that behavioral abnormalities are evident in the first 2 years of life. 

Case studies, parents’ retrospective reports, and analyses of home videos of children 
later diagnosed with ASD are currently the primary methods used to identify behavioral 
features of ASD before age 24 months. However, this body of research is potentially lim- 
ited by methodological problems associated with recall bias, undersampling, and over- 
sampling (for a review, see Zwaigenbaum et al., 2007). To overcome these limitations, we 
are developing passive telemetric audio and video technologies for densely sampling and 
semi-automating behavioral manifestations of ASD in the first two years of life. 

Using a system we call the Speechome Recorder, seen in Figure 14.5a, we can record 
the auditory and visual environments of children in their home environments. High- 
quality camera (fish-eye lens) and microphone (boundary layer) hardware, seen in Figure 
14.5a, are built into the “head” of the recorder. In order to capture detailed facial and 
gestural observations, a second, horizontally oriented camera is incorporated into the 
base of the unit. The recorder contains sufficient computational power and disk stor- 
age to record, compress, and store approximately 2 months of continuous data before 


FIGURE 14.5. (a) Prototype of the portable Speechome Recorder (photograph by Rony Kubat); 
(b) sample video image from the living room of the SpeechHome project pilot study (photo by Deb 
Roy); (c) 30-minute trace of movement of child and father in the pilot study generated by machine- 
assisted video-tracking technology with bright “social hot spots” at points of sustained interaction 
(visualization by George Shaw). 
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it needs to be transferred by removing a portable disk from the base of the device. A 
touch display is used to turn the recorder on and off. Physically, the device resembles an 
arching floor lamp with a friction-fit ceiling brace for stability. The mast folds down and 
may be removed from the base for easy transport. The height and slant of the mast are 
adjustable to a wide range of domestic settings, and a unit can be installed or removed 
in minutes. Privacy measures built into the Speechome Recorder include a one-button 
controller to start and stop recording, a second button to delete recorded data retroac- 
tively, and the capacity for anyone appearing in a specific segment of recording to enable 
permanent deletion of that segment. The recorder also enables semiautomated motion 
analysis, speech detection, and speaker identification analytic capabilities (DeCamp, 
2007; DeCamp & Roy, 2009; Roy, 2007; Roy et al., 2006). A data visualization example 
is provided in Figure 14.5c. 

Developing passive telemetric audio and video technologies for densely sampling 
and semiautomating behavioral coding of young children with ASD could spawn numer- 
ous research avenues. For instance, some of the passive telemetric technology previously 
reviewed in this chapter could be integrated with the Speechome Recorder, including 
various autonomic (e.g., cardiovascular), physical (e.g., accelerometers), and environmen- 
tal (e.g., RFID) sensors that enable additional data collection pertinent to monitoring 
early autistic behavior, such as arousability, SMM, and restricted object usage (e.g., ste- 
reotyped play). Passive telemetrics that can be deployed in real-world settings and record 
intensively over time could also enhance current autism research and treatment in a num- 
ber of important ways. Since clinicians primarily observe young children with ASD in 
laboratory or clinic settings over relatively brief periods of time, enhanced naturalistic 
observation methods could enable assessments in more situations and diversified settings 
that capture a greater range of contexts (e.g., child care settings, pediatricians’ offices). 
Ultradense longitudinal monitoring in home environments could also be used in prospec- 
tive studies that employ sibling designs (e.g., Bryson et al., 2007) and enable closer exami- 
nation of infants during critical periods of development. The level of detailed data made 
available by these sampling devices could also help identify salient behavioral, psycho- 
logical, and speech-based phenomena to guide neurophysiological and genetic assessment 
targets. Putting resources into technology that are more broadly accessible than paper- 
and-pencil methods currently used by autism experts would also enable a much larger 
group of people with relevant experience (e.g., parents, speech and language therapists, 
public health workers) to contribute important observations of young children with ASD. 
Finally, intensive monitoring of young children with ASD repeatedly could help deter- 
mine whether early interventions found to improve outcomes (e.g., National Research 
Council, 2001) are responsible for changing developmental trajectories, or whether gains 
are related to the natural unfolding of developmental processes. 


Issues with Passive Telemetrics and Future Directions 


While passive telemetrics clearly confer significant advantages for measuring and inter- 
vening on a wide variety of behavior in natural settings, one must acknowledge special 
issues relating to privacy and confidentiality, practical considerations, and statistical and 
measurement challenges when using this technology. This section briefly reviews these 
issues and highlights areas that require future research. 
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Privacy and Confidentiality 


While passive telemetrics can make it easier to sense, understand, and react to phenom- 
ena in the physical world, these technologies carry with them potential dangers that must 
be appropriately anticipated. For instance, assurances must be made that data will not 
be gathered to observe others without any controlling authority, or be used to jeopardize 
services to which a research participant is otherwise entitled (e.g., insurance coverage 
when a heart rate monitor shows arrhythmia). Computer scientists who manufacture and 
advocate the use of passive telemetric technologies are among the first to acknowledge 
these potential problems, and are taking steps to ensure security and privacy. Security is 
being addressed through data encryption programs and system designs with clear indica- 
tors that recording is occurring, allowing those being sensed or recorded to determine 
when and where data are gathered and how they are distributed (Iachello & Abowd, 
2005). 

Social and behavior scientists who use this technology must also play an active role 
in ensuring that research participants understand what data are being collected, what 
might be inferred from the data, and how one can stop data collection at desired times. 
Such investigator responsibilities can be revisited in the American Psychological Associa- 
tion’s Ethical Principles of Psychologists and Code of Conduct (2002). Discussions of the 
limits of confidentiality, including the foreseeable use of information generated through 
psychological activities, should occur at the outset of a study and thereafter as new cir- 
cumstances warrant. Social and behavioral scientists should also inform participants of 
the risks to privacy and limits of confidentiality, and before recording voices, images, 
or other types of data telemetrically, obtain explicit consent from all such persons or 
their legal representatives. To minimize intrusions on privacy, social and behavioral sci- 
entists need to include in written and oral reports only information that is germane to 
the purposes for which the recordings are made, and to discuss confidential information 
obtained in their work only for appropriate scientific or professional purposes. Social and 
behavioral scientists must also be careful not to disclose information that could reason- 
ably lead to the identification of a research participant, unless they have obtained prior 
consent from the person observed. As passive telemetric technologies become more per- 
vasive, investigators and research participants will have to compare the perceived benefits 
and costs of the uses of telemetrics. In the interim, attention to issues of security, visibil- 
ity, control, and privacy should help to ensure a more positive use of these technologies. 


Practical Considerations When Using Passive Telemetrics 
to Assess Behavior 


In addition to privacy and confidentiality concerns, a number of practical issues need to 
be considered when using passive telemetrics to assess behavior. For instance, one must 
consider the financial costs involved in purchasing emerging technologies. Fortunately, as 
passive telemetric devices capabilities increase, their size and cost decrease. Intel predicts 
that Moore’s Law (Moore, 1965)—a doubling of computational power and speed every 
24 months— will continue for at least the next 10 years. Combined with falling costs 
and increasing capabilities of memory, the functionality-to-price ratio of passive telem- 
etric devices will only continue to improve. In the next decade, mobile devices such as 
phones and PDAs will have the computational, memory, and networking capabilities of 
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today’s desktop computers. Social and behavioral scientists interested in passive telemet- 
ric monitoring will be able to leverage these common devices, since they will have already 
been purchased by people for their communication, entertainment, and health care needs 
(Intille, Larson, & Kukla, 2002). 

Investigators interested in using passive telemetrics must also be able to identify and 
learn how to use this technology in a manner that meets acceptable standards for prac- 
tice. At the time of this writing, peer-review journals devoted specifically to technologi- 
cal developments and applications in the social and behavioral sciences include Behav- 
ioral Assessment, Behavior Research Methods (formerly Behavior Research Methods, 
Instruments, and Computers), Social Science Computer Review, IEEE Transactions on 
Information Technology in Biomedicine, and Computers in Human Behavior. Annual 
conferences highlighting these technologies, and demonstrating their capabilities through 
exhibition displays, include Measuring Behavior, the International Conference on Meth- 
ods and Techniques in Behavior Research; the Association for Computing Machinery 
(ACM) special interest group on Computer-Human Interaction (SIGCHI); the Institute 
of Electrical and Electronics Engineers (IEEE) International Symposium on Wearable 
Computers (ISWC); Ubicomp; Persuasive; and Body Sensor Networks. 

After identifying a useful system for passive telemetric recording and before initiat- 
ing a study, researchers should get better acquainted with their equipment by using the 
technology themselves for at least several days prior to actual data collection. This allows 
investigators to experience firsthand unforeseeable difficulties a participant may encoun- 
ter and to make necessary adjustments to the assessment procedure. Interacting with the 
technology can also help researchers better train their participants and promote compli- 
ance while under observation. 


Statistical and Measurement Issues 


Passive telemetrics capable of making assessments repeatedly in multiple environments 
will produce intensive longitudinal data (Walls & Schafer, 2006) and a new class of mul- 
tivariate, multiple-subject databases that include many more time points than are typi- 
cally found in the social and behavioral sciences (e.g., 200 or more recorded occasions 
per person). Such intensive data will undoubtedly provide more power to ask and answer 
a range of questions relating to a given behavior’s temporal features and context depen- 
dency. However, along with these opportunities come special statistical and measurement 
challenges that must be effectively dealt with. Key issues include the consideration of time 
as a variable, the anticipation and handling of measurement errors in data streams, the 
selection of appropriate statistical techniques for modeling intensive longitudinal data 
both within and across persons, and the potential for behavioral reactivity to passive 
telemetric measurement. 

Given the sampling intensity of passive telemetrics, establishing the appropriate time 
scale for a study (e.g., hourly, daily, weekly), sampling rate (i.e., spacing of measure- 
ment), and inclusion of time-varying and time-invariant covariates for detecting states or 
changes in a given behavior will need to be determined prior to large-scale implementa- 
tion (for a more extended review of these issues, see Walls, Höppner, & Goodwin, 2007). 
Variations in response times within and between subjects, and the presence of missing 
data, must also be accurately modeled to produce meaningful results. In most instances, 
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passive telemetrics record intensively, thus generating very dense data streams that reduce 
the chances of irregular or missing data. However, longer gaps of missing data may be 
present if signals fall below a minimum detection level, or are not received at all. These 
errors may arise as a result of user errors, calibration problems, sensor misplacement, 
and/or wireless data transfer (Black, Harel, & Matthews, Chapter 19, this volume; 
Nusser et al., 2006). Since many of these data characteristics may not be known a priori, 
more extended exploratory data collection procedures and changing study designs may 
be needed before moving on to confirmatory data analyses. 

Repeated measurements over time on a single subject create serial dependency that 
violates the statistical assumption that errors in the data are independent across obser- 
vations. Analytical approaches that can calculate an autocorrelation between adjacent 
observations will be needed to transform serially dependent data into independent data 
before any further analyses can be carried out. A growing number of statistical methods 
are capable of handling the vast number of observations and dependency inherent in 
intensive longitudinal data, and model within- and between-person observations simul- 
taneously (for extended reviews, see chapters on analytic methods, Part HI, this volume; 
Singer & Willett, 2003; Walls & Schafer, 2006). Some especially promising approaches 
include time series analysis; multilevel modeling; survival analysis and point process mod- 
eling; and pattern-based methods, such as cluster analysis and growth mixture models. 

Time series analysis (Glass, Wilson, & Gottman, 1975; Velicer & Fava, 2003) can be 
used to uncover the autoregressive structure of data, examine temporal regularities (i.e., 
periodicity) or response variations over time, and assess the effects of either a planned 
or unplanned intervention. Techniques capable of visually displaying, quantifying, and 
analyzing two or more synchronized time series either at the variable (e.g., heart rate and 
motor movement) or person level (e.g., mother and child interaction) are also emerging 
(Höppner, Goodwin, & Velicer, 2008; Lamey, Hollenstein, Lewis, & Granic, 2004). 
Multilevel models or hierarchical linear models that conceptualize interindividual hetero- 
geneity in intraindividual change as variability around a common trajectory are useful if 
one wishes to examine simultaneously and explain heterogeneity in datasets containing 
both within- and across-person observations. Survival analysis (Kalbfleisch & Prentice, 
2002) and point process modeling (Blossfeld & Rohwer, 2002) can be used to character- 
ize the distribution of time-to-event data, to determine the likelihood of an event occur- 
ring, to test for differences between groups of individuals, and to model the influences 
of covariates on duration data. Last, if the existence of subgroups is hypothesized but 
not known a priori, pattern-based methods, such as cluster analysis and growth mixture 
models may be used to classify individuals into more homogenous groups based on simi- 
lar or dissimilar trajectories (for a review of common pattern-based methods, see Bauer 
& Curran, 2003). 

Finally, although designers of passive telemetric technologies are working to decrease 
user burden and annoyance (e.g., Hudson et al., 2003) and construct device forms that are 
discreet and comfortable enough to wear or carry for long periods of time (e.g., Knight, 
Baber, Schwirtz, & Bristow, 2002), their potential for causing measurement reactivity is 
currently unknown. In many cases, measurement reactivity will be difficult to assess until 
participants have engaged in data collection for an extended period of time. Social and 
behavioral scientists should be aware of this potential artifact and take proactive steps to 
identify and minimize it. Some suggested ways to assess reactivity to passive telemetric 
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measurement include piloting the data collection devices with participants before initiat- 
ing a study, interviewing participants about their experience with the technology during 
and after a study, and conducting preliminary analyses to look for systematic changes in 
obtained data (e.g., diminished within-person variation over time). 


Summary and Conclusion 


This chapter has described a new class of wireless measurement technology called pas- 
sive telemetrics that can facilitate repeated assessments, potentially reduce behavioral 
reactivity to measurement, and enhance ecological validity in social and behavioral sci- 
ence research. Passive telemetric technology can enable social and behavioral scientists to 
measure behavioral, affective, and physiological processes in real time in natural settings, 
making feasible intensive longitudinal studies that appreciate contextual and temporal 
patterns of human behavior. 

While passive telemetrics may be a useful addition to social and behavioral scientists’ 
measurement armamentaria, a scientific approach to behavioral assessment permits a 
diverse set of methods. Passive telemetrics should be used only if they can sample behav- 
ior in a way that best suits the research question and aims of a study. Like other methods 
and measures for assessing behavior, passive telemetrics has both strengths and weak- 
nesses; it is not a panacea. For instance, wearable computers that place sensors on the 
body can be problematic if people fail to maintain and wear the equipment correctly, or 
have a difficult time habituating to the sensors. Ubicomp sensing is potentially limited by 
its inability to gather some types of data from a distance, the high cost associated with 
large-scale implementation, and the fact that the technology is useful only where the sens- 
ing infrastructure exists. 

The best approach to sampling behavior is to collect data using multiple methods 
(e.g., physiological, self-report, direct observation) and assess convergence between them. 
Such multimethod assessments use alternative measures that differ in method factors but 
assess some facet of the target behavior. For instance, an integrative assessment protocol 
could include wearable sensors to record physiological data; a mobile phone or PDA to 
record subjective information, such as feelings and opinions; and environmental sensors 
to obtain data on behavior and context (e.g., location, proximity, activity). Naturally, 
the configuration of such a system would be dictated by an investigator’s chosen experi- 
mental design, dependent and independent variables, population studied, and mode of 
treatment. 

Finally, much of the literature on passive telemetrics consists of “proof of concept” 
demonstrations designed to show what can be done with this technology. To date, many 
of these applications lack empirical evaluation. The quality of passive telemetric applica- 
tions could be strengthened by collaborative efforts between developers of this technol- 
ogy and social and behavioral scientists during the planning stages. Social and behavioral 
scientists are encouraged to become involved by providing sites where prototypes of the 
technology can be field-tested and data on reliability and validity gathered. Such col- 
laborative arrangements would facilitate the evaluation of passive telemetrics, help to 
determine its functionality and utility, and provide a method for increasing social and 
behavioral scientists’ involvement in shaping this technology. 


264 STUDY DESIGN CONSIDERATIONS AND METHODS OF DATA COLLECTION 


Note 


1. Radio frequency identification (RFID) is a generic term for technologies that use radio waves to iden- 
tify objects automatically. Identification is most commonly achieved by storing a unique serial num- 
ber that identifies an object on a microchip with an embedded antenna. The antenna enables the chip 
to transmit the identification information to a reader. The reader converts the radio waves reflected 
back from the RFID tag into digital information that can then be transmitted to a computer. 
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CHAPTER 15 


Emerging Technolog 
for Studying Daily Life 


STEPHEN S. INTILLE 


his chapter explores emerging developments in the use of technology for studying 

daily life that are not discussed elsewhere in this volume. The rapid adoption of 
sensor-enabled mobile technologies, especially phones, will create new opportunities for 
researchers interested in measurement of behavior. Innovations in technology will also 
make possible novel forms of real-time, computer-driven interventions, in which the com- 
puter presents information at automatically determined and behaviorally based “teach- 
able moments.” 

Prior chapters in this handbook discuss the experience sampling method (ESM; 
Csikszentmihalyi, 1982) and ecological momentary assessment (EMA; Stone & Shiff- 
man, 1994), which are techniques whereby a subject records quantitative or qualitative 
observations about his or her context, often triggered by an electronic timing device. 
These techniques generate longitudinal datasets that often include time series in which 
specific measures of interest are sampled throughout a subject’s everyday life experiences. 
A researcher chooses one of three sampling methods: (1) systematic sampling on a fixed- 
interval schedule, such as every 4 hours; (2) stratified sampling on a random-interval 
schedule, such as, on average, once every 4 hours or sometime randomly within every 
2-hour window; or (3) purposive sampling in response to user initiative, in which the user 
is told to make a data entry whenever he or she performs a particular activity (Conner 
& Lehman, Chapter 5, this volume). Prior chapters also discuss ambulatory monitoring, 
in which devices are worn on the body. Ambulatory monitors can continuously measure 
physiological parameters such as heart rate or heart rate variability and skin conductance 
(F. Wilhelm, Grossman, & Müller, Chapter 12, this volume); body states such as postures 
or body movements (Bussmann & Ebner-Priemer, Chapter 13, this volume); or general 
context, such as location using a global positioning system (GPS), or ambient sounds 
using a recording device (Goodwin, Chapter 14, this volume; Mehl & Robbins, Chapter 
10, this volume). 

This chapter highlights measurement and intervention opportunities that will emerge 
as sensor-enabled mobile devices become more computationally powerful and common- 
place. These phones make possible a new fusion of EMA and ambulatory monitoring, in 
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which triggering of self-report, data collection, or feedback related to encouraging par- 
ticipant compliance is based on automatic analysis of the ambulatory monitoring data in 
real time by the mobile device. Researchers will no longer be limited to systematic, strati- 
fied, or purposive EMA sampling methods. Context-sensitive ecological momentary 
assessment (CS-EMA; Intille, 2007; Intille et al., 2003) creates a fourth option: purposive 
sampling in response to automatically detected user behavior, context, or physiological 
response, as measured passively or semiautomatically using sensors. User behavior might 
consist of information about physical activity, keywords spoken, travel patterns, or even 
use of the mobile phone or applications on the phone. Context might consist of infor- 
mation such as location, proximity to types of places, proximity to other people, light 
exposure, or even current weather. And physiological response might consist of changes 
in heart rate, skin conductance, or other measurable body states. The sampling itself 
may consist of questions presented on the mobile device, as in standard electronic sam- 
pling (Conner, Feldman Barrett, Bliss-Moreau, Lebo, & Kaschub, 2003), but where the 
researcher has the option to have the questions tailored to the detected situation. In addi- 
tion, the sampling may trigger a change in passive data collection fidelity, perhaps causing 
a short audio sample to be recorded, as used in naturalistic observation sampling (Mehl 
& Robbins, Chapter 10, this volume) or turning on a location-finding system just long 
enough to get a reading, but not so long as to drain the phone’s battery unnecessarily. 
This automatic detection of behavior, context, and physiological change might also 
be used for context-sensitive, ecological momentary intervention (EMI; Patrick, Intille, 
& Zabinski, 2005). An EMI tailors interventions in real time, providing personalized 
feedback at the exact moment, or shortly after, a targeted behavior, context, or physi- 
ological state is identified. The mobile device might provide information “just in time,” at 
what might be called “teachable moments” (Intille, 2004), and an increasing number of 
such systems are being envisioned, prototyped, and evaluated (Heron & Smyth, 2010). 


Opportunity 


In the simplest use of CS-EMA, questions can be triggered using explicit, manually coded 
mappings between questions and sensor readings. For example, a question can be trig- 
gered each time a heart rate threshold is exceeded, as detected by a wireless heart rate 
monitor, or each time the mobile device’s location-finding system indicates that a person 
is in a particular place. As more ambulatory and context information is available from 
sensors, however, these systems are likely to evolve until they use the computing device 
(i.e., the mobile phone) to preprocess the data in real time to infer automatically a user 
state, using a pattern recognition algorithm. The data can trigger a question that is based 
on the likelihood that a subject has engaged in a specific behavior, and this inference may 
be based on combining data from several different sensors. 

The opportunity created by CS-EMA and EMI is that information acquisition or 
presentation is timed with relevant behavior. Sampling may occur only during or just 
after the activities of research interest. Intervention content may be delivered only when 
it is likely to be useful to a person, based on that person’s measured behavior, physiologi- 
cal state, or contextual situation. This context-sensitive timing of information delivery 
or requests for self-report may minimize interruption and therefore annoyance without 
compromising the quality of data acquired on behavior or the situation being studied. 
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Sensor-triggered self-reports do not necessarily need to depend on sensors worn by 
the subject. Contextual information may be obtained using the phone’s data network and 
location-finding system. For instance, a phone might continuously monitor the weather 
and ask questions only on sunny days. Or, a phone might infer, based on location infor- 
mation and movement speed, that someone is in a vehicle, and ask a question only if 
traffic-monitoring systems report congestion at that particular location. Furthermore, 
wireless networks make it possible for mobile computing devices to share information 
among studies with multiple participants in the same area. For example, using a Bluetooth 
network, two computing devices can detect when they are within a few meters of one 
another (Eagle & Pentland, 2006) and share data from each other’s sensors. Therefore, 
self-reported readings for one participant in a study can be triggered by sensor readings 
from another participant in the study, or when another family member’s proximity to 
the phone is detected. A researcher interested in how interaction between people impacts 
activity could program the mobile computing device to ask a question only just after two 
people are likely to have been interacting, where interacting is defined as anytime they 
move within some predetermined distance from one another. 

These are just a few examples. As mobile phone adoption increases, increasingly 
sophisticated sensors will be available: built-in accelerometers, magnetometers, gyro- 
scopes, cameras, microphones, GPS and location-finding systems, and light sensors. With 
continuous wireless data connections, new phones can use these sensors in combination 
with databases on the Internet to infer more information about behavior and provide 
additional self-report or trigger data-gathering options. For instance, location might be 
used in combination with a mapping service to determine, in real time, whether some- 
one is in a particular type of neighborhood or near a specific business. In addition, new 
mobile devices have low-power wireless communication capabilities that permit other 
devices worn on the body or placed in the environment to send data to the phones in 
real time. New personal wireless network protocols, such as the Bluetooth Low Energy 
Technology protocol (Bluetooth SIG, 2010), will support continuous use of devices such 
as heart rate monitors (Alive Technologies Products, 2007), activity monitors, and even 
skin conductance monitors (Strauss et al., 2005). Such sensors might provide additional 
information about body movement, use of objects in the environment, or even air quality, 
allowing more detailed inference about a person’s physiological state and context. Phones 
are already being used in place of credit cards, and future researchers may be able to use 
phone payment systems to gather additional information about where people are, what 
they are interested in, and what they are purchasing and consuming. 

Today, using electronic EMA effectively can be significantly more expensive and 
logistically challenging than alternative methods, such as paper-based instruments. Use 
typically requires the research team to obtain and distribute digital devices, hire a pro- 
grammer or company to develop and test the protocol, conduct a usability test to ensure 
devices are working properly, perform data cleaning, and carefully oversee participants 
during the study to manage data quality and minimize missing data (Conner & Lehman, 
Chapter 5, this volume). Activity detection based on sensors is still primitive compared to 
what will be possible in 10 years, when phone sensing capabilities are improved and more 
research has been completed showing how to use CS-EMA in practice. 

Today, the use of CS-EMA or EMI may add additional layers of complexity over 
electronic EMA, but mobile phone adoption trends indicate that an increasing number of 
potential subjects will already own sophisticated, sensor-enabled mobile devices before 
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they are approached to be in a research study. These potential subjects will be accus- 
tomed to carrying these devices nearly everywhere and keeping them charged, unlike 
devices handed out in most current research studies using electronic EMA. The built-in 
Internet access on new phones will permit remote data collection and remote study com- 
pliance monitoring. By leveraging this consumer investment in technology, as well as the 
massive engineering effort being devoted by the telecom industry to the devices and their 
sensing capabilities, researchers will have access to a “free” device that can be used for 
data storage, real-time activity detection, real-time compliance monitoring and encour- 
agement, and novel forms of feedback for health interventions. As phones improve, such 
measurement tools can be improved as well. The tools possible are fundamentally dif- 
ferent from those available using standard phones or older mobile phones. For instance, 
while interactive voice response with mobile devices (Collins, Kashdan, & Gollnisch, 
2003) provides one method to exploit standard phones for research and reach partici- 
pants in the field as they go about their lives, this method does not permit CS-EMA that 
exploits the mobile device’s unique capabilities to use sensors to infer information about 
behavior, context, and physiological state of the participant. 

The remainder of this chapter, therefore, looks toward this future, where compu- 
tationally powerful and sensor-loaded mobile computing devices that are ubiquitous 
throughout the population may be used to detect automatically a person’s context and 
behavior, then use that information to trigger self-report, or intervention content, or tai- 
lor how information is presented or collected. 


Recording Behavior: Life Logging Systems 


Numerous prototypes of new behavioral monitoring devices are created each year and 
published in engineering conferences and journals. Recent examples include a device that 
detects transportation method (Zheng, Li, Chen, Xie, & Ma, 2008); a device that detects 
swallowing and other dietary behaviors (Amft & Troster, 2008); a system that detects 
behaviors in the home, such as making tea (Philipose et al., 2004); software for mobile 
phones that can infer information about foods eaten (Zhu et al., 2010); a device that 
detects physical activity energy expenditure (Albinali, Intille, Haskell, & Rosenberger, 
2010); software that detects indoor location (Bahl & Padmanabhan, 2000); a device 
that records everything someone sees and clusters images into similar places and detects 
context (Clarkson & Pentland, 1999); a device that detects keywords spoken for audio 
keyword spotting (Harada et al., 2008); and technology for storing information about 
everything that someone does or sees (Gemmell, Bell, Lueder, Drucker, & Wong, 2002). 

Commercial personal monitors are also being introduced rapidly. Examples include 
monitors for measuring heart rate (e.g., Polar heart rate monitors), physical activity (e.g., 
Philips DirectLife monitor), sleep (e.g., Zeo’s Personal Sleep Coach), and UV exposure 
(e.g., Oregon Scientific’s Personal UV Monitor). Online tools are being created to help 
people to track their own behavior (e.g., time use and affective states, such as happiness; 
Supernifty’s Track It software) and food consumption (e.g., FitNow’s Loselt! software). 
Today, most commercial monitors are used only for short periods of time, such as during 
exercise, but some “life loggers” are using them for continuous self-tracking of health 
and mood (Wolf, 2009). Researchers have created devices that permit continuously or 
periodically sampled life logging that uses video (Clarkson, 2002), images (Hodges et 
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al., 2006), audio (Mehl, Pennebaker, Crow, Dabbs, & Price, 2001), and physical activity 
(e.g., ActiGraph GT3X). As these technologies mature and more people become inter- 
ested in long-term, personal monitoring, the devices may be worn under, or even embed- 
ded into, clothing (Van Laerhoven, Schmidt, & Gellersen, 2002). 

More engineering is needed to improve the usability and lower the deployment cost 
of these logging systems, but the viability of the basic measurement capability has been 
demonstrated. Some researchers have even argued that it will soon be possible to eas- 
ily, continuously, and economically store digitally nearly everything about one’s life— 
documents read; locations visited; images of what was seen; physiological measurements, 
such as heart rate; and audio clips of sounds heard (Gemmell et al., 2002). As mobile 
devices improve and become the technology that early-adopting life logging consumers 
use to capture information about their behavior, researchers interested in studying behav- 
ior will have new tools at their disposal. 

The research prototypes suggest that the following system could be created today: a 
comfortable but bulky device that collects continuous streams of video data documenting 
everything the subject sees; audio data of everything the subject hears and says; acceler- 
ometer data of the subject’s limb motion; data on physiological parameters, such as heart 
rate; and data on the subject’s location in the community, as well as other, miscellaneous 
data about how the subject is feeling, as reported by the subject via a user interface. 
This device might use the mobile phone and wirelessly connected sensors to create an 
experiential memory digital diary device. Such a device was envisioned as early as 1945, 
but only recently have improvements in the size and cost of technology made the vision 
achievable (Bush, 1945; Gemmell et al., 2002). Life logging systems that continuously 
measure physiological and environmental parameters may provide new insights into the 
behavior of research subjects. They also may provide data that can be used to create CS- 
EMA and EMI systems. 


Case Study: CS-EMA and EMI 
in Physical Activity Measurement and Motivation 


My research group is developing a system intended to enable cost-effective, population- 
scale, time series measurement of type, duration, intensity, and location of physical and 
sedentary activity. Researchers studying physical and sedentary activity are interested in 
developing better tools to understand how, why, when, and where people get exercise of 
moderate or greater intensity, or engage in various sedentary behaviors, and new tools to 
motivate behavior change. The system is intended for research purposes, but life loggers 
might use it to create detailed, long-term diaries of their physical behavior. In this chapter, 
the system design is used as a case study to highlight some opportunities and challenges 
of creating CS-EMA and EMI systems, which are further discussed in the conclusion of 
this chapter. Although the focus here is on physical activity, the same opportunities and 
challenges would apply to systems measuring other aspects of behavior, physiological 
changes, or context. 

Figure 15.1 shows an overview of the envisioned Wockets System, so named because 
it uses custom-designed, miniature, wearable motion sensors called Wockets (Intille, 
Albinali, Mota, & Haskell, 2010). Important components and assumptions are (num- 
bered to correspond to Figure 15.1) described below. 
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FIGURE 15.1. Overview of the Wockets continuous physical activity monitoring system. The 
Wockets System uses a standard mobile phone and a set of miniature wearable accelerometers that 
send raw data about motion to the phone. Each numbered component of the diagram is described 
in the text. 


The Mobile Device (1) 


The Wockets System is designed to exploit common phones that subjects are already 
using. As of June 2008, 84% of the U.S. population was estimated to have mobile phones 
(CTIA: the Wireless Association, 2008; SNL Kagan, 2008), and an estimated 79% of all 
teens (ages 13-19) had a mobile device, a 36% increase since only 2005. Phones are even 
being introduced for preteens (Howe, 2005), and some countries have far more mobile 
phone subscriptions than people, suggesting a nearly 100% penetration rate for adults 
(International Telecommunication Union, 2010). Smartphones are expected to dominate 
the market soon. In the United States, smartphone adoption rose from 7% in 2007 to 17% 
in 2009 and is predicted to continue rising dramatically (Forrester Research, 2010). 

Opportunity. Just as many studies today take for granted that study participants 
will have access to a telephone, within 10 years it will be reasonable to assume that 
participants will have access to an advanced, sensor-enabled mobile computing device. 
Leveraging mobile phone devices that research participants already own may lower the 
equipment costs of some studies, because researchers will load software onto those exist- 
ing devices. One of the sensors in the phone is likely to be an accelerometer. 

Challenges. The mobile phone market is fragmented, requiring systems to be devel- 
oped on multiple operating systems. There will always be some subjects without appro- 
priate devices, and study designs must accommodate this. Layering a study protocol onto 
a phone used for other purposes will reduce researcher control over the device, and some 
manufacturers currently limit how certain phone functions are used, making some types 
of CS-EMA software infeasible. Finally, running computationally and sensor-intensive 
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applications continuously could affect phone usability if applications are not carefully 
designed. 


App Stores (2) 


Participants with suitable phones may be identified and recruited via app stores, which 
are Web services that phone users browse to identify and download software to their 
smartphones. Researchers have just begun to explore the use of app stores for research 
(Miluzzo, Lane, Lu, & Campbell, 2010; Morrison, Reeves, McMillan, & Chalmers, 
2010). 

Opportunity. The app store model may make it possible to recruit very large num- 
bers of individuals quickly into research studies. 

Challenges. Practical obstacles for some studies may be obtaining informed consent 
on a mobile device, creating an easy mechanism to compensate subjects, and avoiding or 
adjusting analyses for subject-selection biases. 


Context-Sensitive EMA (3) 


The advanced capabilities of the phones that participants own will enable CS-EMA stud- 
ies that take advantage of each participant’s desire to carry and keep charged the mobile 
phone. Open-source tools for implemented electronic EMA, such as the MyExperience 
tool (Froehlich, 2010; Froehlich, Chen, Consolvo, Harrison, & Landay, 2007) and Open 
Data Kit (ODK; 2010), are lowering barriers to implementing standard EMA and CS- 
EMA on smartphones. 

Opportunity. The use of personal phones should reduce missing data, because sub- 
jects will not be required to remember to carry a second device, and make longer studies 
more viable because no special equipment must be distributed. 

Challenges. Existing software is still difficult for researchers without programming 
experience to use, and CS-EMA adds additional complexity. 


Remote Data Collection (4) 


Once someone is running the software, data from the electronic surveys is sent to a secure 
remote server in real time for reviewing and cleaning data, exploiting the phone’s data 
network. 

Opportunity. This real-time, remote review will allow researchers quickly to spot 
problems such as missing data or lack of participant compliance with study protocols. 

Challenges. Studies that require sending high-bandwidth, raw data streams of many 
megabytes, such as days of accelerometer data, pictures, or audio, may additionally need 
to compensate participants for the cost of data transfer. 


Remote Study Administration (5) 


Researchers will have the ability remotely to update software protocols on the phones, 
using the phone data network. These protocols can exploit the sensing system. For exam- 
ple, a researcher can have a self-report question triggered only when a participant is either 
currently exercising (as detected by the system) or was exercising sometime in the past 2 
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hours. If the GPS sensor is also used and an appropriate GIS database is available, a ques- 
tion could be triggered only if the system determines that the participant may have just 
spent time in or near a restaurant. 

Opportunity. Real-time updating of protocols with participants in the field may per- 
mit development of high-quality protocols and allow researchers to respond to early data 
or pilot data, without requiring a costly face-to-face visit with participants. 

Challenges. Procedures must be developed for remote informed consent, and for 
participant education and training. 


Multiple Motion Sensors (6) 


Although it is possible to use a single phone accelerometer to gather information about 
physical activity (Saponas, Lester, Froehlich, Fogarty, & Landay, 2008), our pilot stud- 
ies suggest that adding additional accelerometers to capture both upper and lower body 
motion is likely to improve performance on a diverse set of lifestyle activities (Bao & 
Intille, 2004; Munguia Tapia et al., 2007). With the Wockets System, once a participant 
running the software consents, he or she may be mailed the low-cost Wocket accelerom- 
eter devices. These motion sensors, which can be worn under the clothing for 24-hour 
periods at a time, send raw accelerometer data to the software running on the mobile 
phone. 

Opportunity. As phones become more sophisticated, sensors that supplement a 
phone’s built-in sensors, such as the Wockets. can be designed to exploit the phone’s pro- 
cessing and data storage capabilities, thereby minimizing the size and cost of the devices, 
and maximizing wearability and permitting large-scale distribution. Participants can be 
asked to wear the small sensors all day, even during sleep. 

Challenges. Asking participants to carry or wear any extra devices creates addi- 
tional complexity and cost in study design, and will impact overall compliance; although 
remote monitoring by study personnel will help with compliance, staff time is expensive. 
Furthermore, providing financial incentives will not scale well, so the phone software 
needs to reinforce the desired use of the sensors automatically. 


Flexibility (7) 


Researchers typically ask subjects to wear ambulatory monitoring systems in one pre- 
ferred location on the body. For instance, motion monitors are often worn on the hip. 
In contrast, the Wockets System is designed for flexibility. Potential study participants 
expressed a strong desire to be able to wear the sensors flexibly if they were to wear them 
for long periods of time. For instance, a participant may be willing to wear a sensor on 
the wrist under a shirt at an office environment but be unwilling to do so when going 
out to a party. The Wockets System empowers the participant with several options. Each 
morning when the participant wakes up, a decision is made about how to wear the sen- 
sors for a particular day. If someone normally wears the Wocket band on the wrist but is 
going to a party and does not wish to do so, he or she can wear the Wocket on the upper 
arm. Some data are still obtained without undue burden on the participant. The partici- 
pant uses the phone to indicate the sensor positions, and the phone uses data from the 
sensors to confirm that this placement is as expected. If not, the phone provides real-time 
feedback to help the participant use the monitors correctly. 
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Opportunity. For some classes of ambulatory monitors, wearable flexibility may be 
viable using pattern recognition algorithms that accommodate different location—sensor 
configurations. This flexibility is likely to improve compliance, so that certain everyday 
events do not result in missing data. 

Challenges. Some configurations of sensors do not perform as well as others, which 
must be taken into account in data analysis. Furthermore, to create the opportunity 
for this customization requires interaction from the participant; in the Wockets System, 
examples of specific target activities for each set of configurations must be provided. 


Easy Continuous Use (8) 


Sensors are worn for 24 hours and swapped with a second set that was charging the 
previous day. 

Opportunity. The system exploits the participant’s desire to keep his or her phone 
charged on a daily basis. 

Challenges. Battery life limitations for wearable ambulatory monitors may create 
barriers to deployment in some cases. The Wockets System hardware and software had 
to be specially designed and optimized to achieve 24-hour performance on a single charge 
of sufficiently small batteries; some functionality that might be desirable, such as use of 
gyroscopes in addition to accelerometers, had to be excluded to meet that goal. 


Raw Data Availability (9) 


The mobile phone software processes and saves the raw motion data from the Wockets, 
which are typically 40 Hz three-axis acceleration signals. It can also save raw data from 
the phone’s internal accelerometer. The raw data can be uploaded for researcher use or 
review each night, if desired, using the phone’s wireless data network when the phone is 
plugged in and charging. 

Opportunity. Increasingly, behavior measurement systems will permit the collection 
and use of raw data rather than proprietary summary measures based on the raw data. 

Challenges. New tools to make it easy to store and process immense amounts of data 
need to be further developed. 


Automatic Inference of Activity (10) 


The phone software uses the raw data and pattern recognition algorithms to infer the 
participant’s activities using supervised learning algorithms. These algorithms work best 
when each subject provides training data of particular target activities performed while 
wearing the sensors in particular configurations. For instance, the participant might indi- 
cate on the phone that an activity, such as “walking briskly,” has just started, perform the 
activity normally for 30 seconds, then indicate the activity has stopped. The algorithm 
will then rebuild the activity models used by the recognition system, and in most cases, 
performance will improve. 

Opportunity. Unlike most monitoring systems used today (and described in the 
remainder of this handbook), future systems may take advantage of human-computer 
interaction to improve device performance, without necessarily needing a researcher 
present. Real-time information feedback to the mobile phone will make this possible. 
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Challenges. Developing interfaces that make contribution of the training data suf- 
ficiently easy requires additional research. 


Encouraging Compliance with Real-Time Feedback (11) 


The Wockets System permits continuous monitoring of phone usage and data gathered 
from the Wockets. If the participant is not using the sensors, the phone provides real-time 
compliance feedback to remind the individual gently of how and when to wear them. 

Opportunity. The phone, which the participant will be highly motivated to keep 
working, provides a real-time communication channel that the participant can exploit 
to make device management feasible, even for devices that must be charged daily and 
worn on the body in particular ways. The ability for mobile phones to detect wear time 
and respond instantly with feedback to study participants could dramatically increase 
compliance, thereby protecting statistical power by maintaining the majority of recruited 
participants. 

Challenges. Change of a variable of interest in direct response to the process of being 
measured is called reactivity. Work to date suggests that reactivity to EMA may be small 
(Hufford, Shields, Shiffman, Paty, & Balabanis, 2002), but reactivity may be more severe 
in systems that place a higher burden on the user, such as wearing ambulatory sensors or 
getting prompted with real-time feedback when the sensors are not being worn (Barta, 
Tennen, & Litt, Chapter 6, this volume). Context-triggered prompts for information, or 
prompts reminding participants to wear sensors properly if they are not, could create 
reactivity effects, because participants will be acutely aware that the system is knowl- 
edgeable about their behavior. 


Incremental Data Verification and Cleaning (12) 


The secure website where data are collected uses heuristics to flag data automatically that 
may be corrupt or missing. 

Opportunity. The researcher can then use the website to send messages to the par- 
ticipant via the phone software to resolve problems efficiently, and quick identification 
of problems will minimize the number of days when unusable data are collected due to 
participant noncompliance or equipment failure. 

Challenges. Tools to facilitate this type of remote research must be developed that 
are very simple for study staff to use. 


Design for Engagement (13) 


The Wockets System exploits the multimedia capabilities of the phone by presenting 
entertaining content, such as jokes, surprises, and accolades from the research team, as a 
form of daily positive reinforcement when quality data are being gathered. 

Opportunity. Summary information gathered by the sensor system about the par- 
ticipant’s own behaviors can be presented in a timely fashion as a reward for good com- 
pliance, perhaps reducing the amount of monetary compensation required. 

Challenges. Too much feedback may raise additional concerns about reactivity, and 
feedback that does not dynamically change throughout a long study is likely to lose 
novelty and become less effective. More research on maintaining engagement for long 
periods of time is needed. 
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Longitudinal Data Collection (14) 


The real-time feedback, positive reinforcement, comfortable miniature devices, and use 
of the phone that someone will carry and keep adequately charged contribute to the fea- 
sibility of having a participant wear the system continuously for many months. 

Opportunity. Researchers may subsample from within the many months of data 
acquired from each participant, well beyond the period during which use of the system 
is novel and may impact behavior. Researchers can explore specific hypotheses, but the 
detailed datasets of type, duration, intensity, and location of physical activity, along with 
addition information acquired via self-report relating physical activity to other factors, 
can be used for data-driven discovery. 

Challenges. Dealing with these time series data will require new statistical meth- 


ods. 


The Future 


Looking to the future, researchers should expect that in addition to the measurement 
devices discussed in the other chapters in this handbook, systems with the goal of dense, 
real-time, longitudinal measurement of behavior, context, physiological state, and self- 
reported state (e.g., affect) will become available. The Wockets System and other new, 
real-time longitudinal behavior monitoring technologies on the horizon will eventually 
do one or more of the following: 


e Be downloaded from an “app store,” directly onto a standard mobile phone. 

e Teach someone how to use and wear it, collecting training data in an entertaining 
way. 

e Detect specific types of activities or contextual situations with more fidelity if one 
or more additional wireless devices that communicate with the phone are used. 

e Provide entertaining audiovisual feedback when it is not working or being worn 
properly. 

e Allow researchers to trigger contextually specific or behaviorally specific ques- 
tions based on context and behavior automatically measured or inferred from 
phone sensors, ambulatory monitors, or self-report. 

e Create engagement by providing engaging feedback to the subject using the 
detected information about the subject’s behavior, such as applications for health 
monitoring, time management, or even games. 

e Transmit data about activity and compliance to researchers daily. 

e Create long-term, second-by-second activity maps for researchers, overlaid on the 
location where the activity took place, as detected by the phone’s location-sensing 
system. 

e Provide new opportunities to create EMIs that influence in-the-moment decision 
making with tailored, just-in-time feedback. 


The datasets gathered by such a system—likely at significantly lower cost than using 
current tools and methods, if the appropriate software tools and training materials are 
available—will permit affordable data-driven approaches, as mentioned previously, pos- 
sibly with thousands or hundreds of thousands of people. 
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Such systems, especially those that use CS-EMA, may raise additional challenges 
related to acquisition of training data, statistical analysis of data, multiple-sensor data 
fusion, and statistical analysis. Some of the challenges are discussed in the remainder of 
this chapter. Once overcome, these challenges will make possible data-driven discovery 
on massive datasets. 


Acquisition of Training Data 


The first step in developing a supervised learning algorithm, such as the one used by the 
Wockets System, is to generate labeled data for a training algorithm. Since most user 
activity is strongly influenced by the setting, training data are most useful if generated 
from settings representative of the field study. Ideally, test volunteers are recruited from 
groups that have characteristics similar to the study’s target population, and realistic test 
activities and similar environments are used as the basis for training examples; however, 
for certain types of activity, these data may not be available (Intille, Bao, Munguia Tapia, 
& Rondoni, 2004). 

In some cases, this training step will likely be performed during the development of 
the algorithm, not by the researcher who is using pattern recognition algorithms for trig- 
gering context-based queries and data collection. In other cases, systems may be designed 
so that either researchers or subjects gather their own training data. Subject-specific 
training data collection might work as follows: The subject carries the phone and wears 
any additional sensors used in the system; the system may be initialized with data from 
a representative sample population. Alternatively or in addition, the phone may ask the 
subject to perform specific types of activities for a few seconds each. For example, the 
Wockets System works best when each user of the system provides 30-second examples of 
each of the target activities. This is similar to the custom training step used in commercial 
speech dictation software, which works without person-specific speech data but improves 
if the user reads a few pages of prespecified text. 

Once the phone has initial training data, the subject may start to use the device. Upon 
identifying a specific type of physical event, the phone can then prompt the study partici- 
pant to respond to specific questions during or just after he or she engages in specific recog- 
nized behaviors. Essentially, electronic EMA is used to acquire additional training data. 


Multiple-Sensor Data Fusion 


Access to raw sensor data from inexpensive sensors should help facilitate data fusion 
to improve behavior measurement. Some devices incorporate multiple sensors into their 
electronics. Others permit researchers to gather data from multiple sensors that are 
time synchronized. Researchers should expect improved measurement capabilities from 
devices using sensor fusion, including more detailed and accurate information about 
activity type, device use/compliance, and environmental contexts, such as location. For 
example, newer systems might combine information from GPS, motion, heart rate, infra- 
red and ultraviolet light, direction, air pressure, and even ambient sound sensors to infer 
whether someone is indoors or outdoors—information any sensor alone cannot reliably 
provide. The system might also use multiple sensors to infer location and mode of trans- 
port (Liao, Patterson, Fox, & Kautz, 2007). Low-power wireless protocols may also 
provide new sensing capabilities. For instance, by detecting proximity to mobile phones 
with Bluetooth, emerging devices may even be able to detect whether someone is likely to 
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be near another person, such as a family member, providing new contextual information 
(Raento, Oulasvirta, & Eagle, 2009). 

Instrumenting a person’s environment is becoming increasingly practical. For instance, 
radio frequency identification (RFID) stickers costing less than $1 can be placed on objects 
in a home and used to detect location and type of a person’s everyday activities (Philipose, 
2005; Philipose et al., 2004). Such technology could be deployed in settings where partici- 
pants are likely to be spending time, such as their homes and offices, and be used to mea- 
sure interaction with objects such as televisions and computers. Used in combination with 
motion data, such sensing may further improve activity inference algorithms. 


Statistical Analysis 


The activity recognition systems used in CS-EMA will not be free of noise and missing 
data, which may complicate statistical analysis of the data obtained. In most situations 
the researcher using CS-EMA software will not have an easy way to determine why the 
activity recognition and context recognition algorithms work or fail, especially when 
the algorithms use multisensor data fusion or several layers of inference. For example, in 
the Wockets System, raw accelerometer data from multiple Wockets are filtered to fill in 
missing data and smooth noise. Features such as the acceleration mean, energy, entropy, 
and correlation measures are then computed on short windows of the filtered data. Other 
features may be computed as well, based on location, time, or prior behavior patterns. A 
feature vector is then fed into a decision tree classification algorithm (Quinlan, 1993) to 
detect which activity is occurring. In some systems, the classification algorithms them- 
selves may use the results of other classification algorithms as feature input. For instance, 
results from classifiers for activities such as posture or ambulation (e.g., “walking”) may 
be used as feature input into other classifiers that detect longer-term activities, such as 
“exercising,” “cooking,” or “socializing.” 

The analysis of data from or dependent on activity recognition must be considered 
carefully, because the algorithms will make false-positive detections of specific activities, 
and they will make false-negative errors when activities are not detected. Some algo- 
rithms will characterize their own error when making a classification. Others will not. 
The performance of the algorithms will be based entirely on the quality of the training 
datasets, which may be difficult to assess given that the participant may have collected 
the training data on his or her own, without any supervision from a researcher. The size 
and quality of that training set will have a dramatic influence on the accuracy and rel- 
evance of the misclassification rates produced. 

Obtaining measurement error information may be difficult. Estimation data or 
prior estimates of measurement error properties of a sensor may not be available to the 
researcher because of lack of documentation or proprietary concerns, or the assessment 
of measurement properties may be developed in an experimental environment that is not 
well matched to the study setting. These issues are discussed in more detail elsewhere 
(Nusser, Intille, & Maitra, 2006). 


Data-Driven Discovery 


Suppose a study participant downloads future CS-EMA software onto his or her mobile 
phone and gets in the habit of carrying the phone a particular way. Perhaps additional 
sensors are also used, such as Wockets. Data are sent to the research server, and com- 
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pliance can be monitored remotely. The researchers running the study do not need to 
recover the phone or the inexpensive Wockets, and there is no reason that data collection 
must be limited to a week, several weeks, or even several months. In some cases, it may 
be possible to run studies for years with little administrative overhead, with tools auto- 
matically processing incoming data, and researchers subsampling from within long time 
frames of data collection. 

Methodological innovations, such as remote data collection and compliance moni- 
toring using mobile phones that people already own, may make studies with thousands or 
tens of thousands of participants affordable. This makes data-driven discovery feasible, 
where algorithms mine massive datasets for unexpected trends (Glymour, 2004). Data- 
driven discovery may complement the hypothesis-driven style of research most commonly 
used today. 


Summary 


Emerging developments in sensor-enabled mobile technologies, especially phones, will 
create new opportunities for researchers interested in longitudinal measurement of behav- 
ior. Some examples have been highlighted, in particular the potential of CS-EMA, which 
fuses standard electronic EMA and ambulatory monitoring to enable automatic triggering 
of self-report based on behavior. Real-time systems that use pattern recognition, sensor 
fusion, and the sensing capabilities of phones and devices that communicate with phones 
should enable large-scale, long-term studies that generate high-fidelity datasets about spe- 
cific behaviors of study participants. These datasets might be used for data-driven dis- 
covery in addition to hypothesis-driven research, and the ability for the technologies to 
respond to behavior in real time will make possible novel interventions using EMA 
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ntensive longitudinal studies are not for the faint of heart: Each one typically requires a 

lot of time, effort, and financial resources. Before commencing such a study, therefore, 
it seems worthwhile to assess its statistical power: Namely, given the proposed design 
and sample size, what is the probability of detecting a hypothesized effect if one actually 
exists? If that probability is, for example, 0.5, then it may not make sense to proceed. 
Why embark on an arduous project if it has only a 50:50 chance of success? Until recently, 
however, researchers wishing to conduct power analyses for intensive longitudinal stud- 
ies had few resources on which to draw. The situation has recently improved to the extent 
that it is now possible to give basic advice on how to carry out a power analysis using 
widely available software packages. That is the subject of our chapter. 

First, the good news: The exercise of conducting a power analysis for an intensive 
longitudinal study greatly increases understanding of those designs and of how that can 
be used to capture the phenomenon of interest. The not so good news: Conducting a 
power analysis for intensive longitudinal studies is considerably more challenging than 
for simpler designs. Because intensive longitudinal studies involve multiple sources of 
random variation, one needs to make assumptions about each source in order to do the 
required calculations. It is especially helpful, therefore, to have some prior data available 
upon which to base the assumptions. 

Researchers who want to conduct a power analysis for an intensive longitudinal 
study can choose from several options: (1) working with the power formulae available in 
books covering multilevel modeling, repeated measures designs, and longitudinal designs 
(Fitzmaurice, Laird, & Ware, 2004; Gelman & Hill, 2007; Hox, 2010; Moerbeek, Van 
Breukelen, & Berger, 2008; Snijders & Bosker, 1993, 1999); (2) using specialized soft- 
ware designed for power analyses for multilevel and longitudinal models, for example, the 
freely available PinT (Bosker, Snijders, & Guldemond, 2007), Optimal Design (Rauden- 
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bush, Spybrook, Liu, & Congdon, 2006), and RMASS2 (Hedeker, Gibbons, & Water- 
naux, 1999); and (3) using simulation methods in general purpose programming soft- 
ware, such as Mplus 6.2 (Muthén & Muthén, 1998-2007), Statistical Analysis Software 
(SAS; Littell, Milliken, Stroup, Wolfinger, & Schabenberger, 2006; SAS Institute, Inc., 
2010), R (R Development Core Team, 2011), or MATLAB (MathWorks, Inc., 2011). 

Each of these approaches has advantages and drawbacks. The approach we advocate 
and demonstrate in this chapter is to use simulation methods in Mplus, as proposed by 
Muthén and Muthén (2002). A free demonstration version of Mplus is available at www. 
statmodel.com. Unfortunately, to run the model described in this chapter with the dem- 
onstration version of Mplus, one needs to drop one predictor variable. We return to the 
justification of the model and the risks of using a simpler version later in the chapter. 

The approach we advocate requires some initial effort to translate the research ques- 
tion into Mplus syntax. This effort could be substantial for those not already familiar 
with Mplus and the logic of simulation. Nevertheless, the payoff will be large because 
this approach is flexible enough to accommodate a wide variety of intensive longitudinal 
designs, and it can be used for simple multilevel models. 


The Basics 


The problem: Researchers want to draw a conclusion about a phenomenon they hypoth- 
esize to exist in a population, but usually they rely only on data from a sample. Now sta- 
tistical theory tells us how we can make the inferential leap from sample to population, 
but it forces us to accept that there will be considerable uncertainty in our inferences. At 
its core, a power analysis is an assessment of whether the population effect size or signal 
strength the researcher wants to be able to detect is large enough to be detected given the 
uncertainty or noise due to the use of a sample. 

The population effect size or signal strength is often a mean difference or regression 
slope. For example, a researcher might hypothesize that each additional hour of exercise 
by the average subject in a population reduces end-of-day depression by 0.15 units. The 
effect can also be a variance, such as how much people in the population differ from 
one another in their regression slopes. Some readers may find our use of the term effect 
size off-putting given that we use it in cases where the effect is based on an experimen- 
tal manipulation, and also where it is simply a summary of a relationship between two 
nonmanipulated variables. This usage, however, is extremely common, and rather than 
risk confusing readers with an unfamiliar but more accurate term, we opted to follow 
common practice. 

Uncertainty or noise, in statistical terms, is how much the sample estimate of the 
population effect varies from sample to sample. For example, if the previous exercise 
slope predicting depression were -0.15 in one sample and -1.10 in a replication sample, 
then researchers should not trust either as being an accurate reflection of the true popu- 
lation value. When the effect size of interest involves a single degree of freedom (e.g., a 
mean difference or regression slope) rather than multiple degrees of freedom (e.g., differ- 
ences among three conditions), the standard error of the effect under investigation is used 
to assess the amount of statistical noise in the estimate (Rosenthal, Rosnow, & Rubin, 
2000). 
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As noted, hypothesis tests can be thought of as assessment of signal-to-noise ratios. 
It is not surprising, then, that many common statistical tests involve computing the ratio 
of the sample effect size (signal strength) to its standard error (noise strength), resulting 
in a t or a z value. To conduct a power analysis, a researcher is fundamentally asking, “If 
I want to have a good chance of detecting the hypothesized effect size, and if I carry out 
the study as planned, how big a standard error is the effect size of interest likely to have?” 
Designs that result in a smaller standard error are better able to detect a given effect size 
and therefore have greater power. 

Which factors influence power? Almost 80 years ago, Neyman and Pearson (1933) 
developed the appropriate statistical framework to answer this question. Whereas the 
existing approach to hypothesis testing developed by Fisher (1925) was concerned only 
with accepting or rejecting a null hypothesis (usually the null hypothesis of no effect in 
the population), Neyman and Pearson introduced the idea of an alternative hypothesis, 
and through this, the idea of a population effect size. In their framework, it was pos- 
sible to hypothesize a particular population effect size and ask about the probability that 
samples from such a population were likely to detect that effect. 

So, for a power analysis, we work from the assumption that the population effect 
size is a particular nonzero value. Jacob Cohen, beginning in the 1960s, simplified this 
process by suggesting particular nonzero values that could be used for particular research 
questions. For tests of mean differences, for example, he suggested that population dif- 
ferences of 0.2, 0.5, and 0.8 standard deviation units be regarded as “small,” “medium,” 
and “large,” respectively (Cohen, 1969, 1988). Not surprisingly, power depends on the 
size of the effect that one wishes to detect. A particular study could have sufficient power 
to detect, in Cohen’s terminology, a medium or large effect size, but not a small one. Also, 
a study could have sufficient power to test one hypothesis (e.g., a hypothesis regarding 
group differences) but insufficient power to test a more complicated hypothesis (e.g., a 
hypothesis that group differences depend on the age of the participants). 

At this point we provide some statistical notation that will be helpful in building 
a concrete example of a power analysis for intensive longitudinal data. Equation 1 rep- 
resents a population regression model, where Y, a continuous dependent variable, is 
regressed on one predictor, X,, where i indexes subjects. Assume that X, can be continu- 
ous or a binary categorical variable. For the purposes of exposition, assume that Y; is 
subject i’s score on a depression scale, and that X, is a binary treatment versus control 
variable, where subjects in the control group are coded 0 and those in the treatment group 
are coded 1. In this case, Bọ is the level of depression of people in the control group, By + Bı 
is the level of depression in the treatment group, and ß, is the difference between the two 
groups, namely, the treatment effect on depression. The error term, €;, represents each 
subject i’s deviation from the true mean score of their respective groups. 


Y; = Bo + B1 X; + & (1) 
Assume now that we draw a sample from a population where this statistical model 


holds. For this model, the power of any given sample to detect the effect B, of the treat- 
ment X, is determined by five main factors. These are summarized in Equation 2: 


Pwr = f(B,, N, 0% 0%, Q) (2) 
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The first factor is B,, the effect size for the treatment. If, for example, the treatment 
lowered depression by 2.5 units, the effect size would be B, = -2.5. Not surprisingly, the 
bigger the absolute value of the effect size, the easier it is for any given sample to detect 
it. Although it is often useful to express effect sizes in standardized units, such as those 
developed by Cohen, unstandardized effect sizes are perfectly legitimate. Note that the 
effect size in Equation 2 is an unstandardized regression coefficient. 

The second factor is the sample size, N. Larger sample sizes make it easier to detect 
a population effect. The third factor is 0%, the extent to which X varies. For example, a 
treatment study with most of the participants in the control condition would have less 
power than a design with the same sample size but equal numbers of participants in each 
condition. The fourth factor is 07, the extent to which there is a lot of or a little unex- 
plained variance in Y. If, in our example, we were sampling from a population where 
depression scores had high between-subject variability, our study would have less power 
than it would have were we sampling from a population with lower between-subject vari- 
ability. Finally, power is a function of what o, or Type I error probability, we choose to 
use. The more stringent the a (e.g., using .01 instead of .05), the lower the power. 

In experimental or intervention studies, the investigators have control over some but 
not all of these factors. Thus, a stronger manipulation in an experiment will increase the 
effect size B, and therefore the probability of concluding that an effect actually exists. 
Similarly, it is common in experimental studies to have some control over the sample 
size, as well as ensuring equal sample sizes across the conditions of X. The unexplained 
variability can be reduced by standard experimental techniques of blocking or matching. 
The Type I error rate is under the investigator’s control in theory, but accumulated norms 
have become so strong that in practice most investigators use .05, and it is rare to see 
investigators use alphas more lenient than .10. 


Power in Intensive Longitudinal Studies 


All things being equal, within-subject designs tend to have more power than between- 
subject designs (Maxwell & Delaney, 2004). In within-subject experimental designs, 
for example, we compare subjects to themselves under different conditions; in between- 
subject designs we compare subjects in one condition to subjects in other conditions. 
Thus, in within-subject experimental designs, stable, systematic differences between sub- 
jects (e.g., neighborhood quality, social desirability, extraversion, or liking of physical 
activity) cannot contribute to the within-subject error variance because these individual 
differences affect all the repeated measures of a subject in the same way. By contrast, 
stable features of the individual and his or her environment do contribute to error vari- 
ance in between-subject experimental designs. 

This benefit does not necessarily apply to nonexperimental within-subject designs, 
of which intensive longitudinal studies are a prominent example. In these nonexperimen- 
tal designs we are often interested in assessing possible causal effects of within-subject 
changes in X, but we fail to realize that between-subject differences in mean X can bias 
estimates of such effects. This problem has been recognized for decades in econometrics 
(e.g., Judge, Griffiths, Hill, & Lee, 1980) but was poorly understood in other branches 
of social science until it was highlighted in recent years by Allison (2005, 2009; see also 
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Curran & Bauer, 2011; Hoffman & Stawski, 2009). There are various ways of dealing 
with this problem; the one we use requires (1) that all within-subject X’s are decom- 
posed into their between- and within-subject components, (2) that both components are 
included as predictors in analyses, and (3) that only the coefficients for the within-subject 
components are taken as evidence of within-subject effects. 

As we did earlier for standard single-level power analysis, we begin by specifying a 
population statistical model that we regard as adequate to handle basic analyses of inten- 
sive longitudinal data. We use a continuous outcome, but in this case it varies over subjects 
and over time. Thus in Equation 3 below, Y, is the score on Y for subject ion measure- 
ment occasion t. For the sample dataset presented later, we analyze end-of-day depression 
scores for N = 66 subjects assessed for T = 9 consecutive days. The predictor variable X, 
also varies over subjects and time, and in our empirical example this will be the amount of 
physical activity a subject i engaged in on day t. However, as noted, to avoid potential bias 
in assessing within-subject effects, in the analysis we do not use X, in its original form; 
rather, we split it into two components, XB, the component that varies between subjects 
only, and X W,, the component that varies within subjects only. XB; is each subject’s mean 
value on X, averaging over all occasions T; X W, is how much on occasion ¢ a subject 7’s 
score is higher or lower than his or her mean level XB;. Given that the original X, is simply 
the sum of XB; and X W, no information is lost in creating these new variables. 

Equation 3 represents a population multilevel regression with three predictors. Please 
note: We use a single overall equation in Equation 3 to facilitate comparison with the 
single-level regression model presented in Equation 1. This contrasts with many exposi- 
tions of multilevel statistical models, where at least one separate equation is presented for 
each level of analysis. Also, we use the letter B where it is more common to use y, again 
to facilitate comparison with Equation 1. 


Ya = Bo; + BX Wi + BBXB; + BT, + vy (3) 


The coefficients By; and B,; are random variables representing subject-specific inter- 
cepts and XW slopes, respectively. We assume that the coefficients By; and B,; are nor- 
mally distributed with a population mean and variance that we represent as Bọ and 6j,, 
and B, and 03 , respectively. The third coefficient, ß,, shows how strongly between-subject 
differences in X relate to Y. It is not unusual for it to be very different in size from ß,, 
the coefficient for within-subject X for the average subject (see, e.g., Bolger & Schilling, 
1991). The fourth coefficient, B}, is the time slope, the average change in outcome Y per 
time unit T. We assume in this case that the time effect does not vary across participants. 
It is important to include time in the model because time can be an influence on both 
X and Y and can lead to a spurious relationship between the two. We also include an 
error term, v;,, specific to each subject and occasion. As is common in longitudinal and 
time series analysis, we allow adjacent error terms to be correlated. This permits us to 
decompose the error term into an autocorrelation component pvp, and a pure random 
error component &,. We assume that £, is a normally distributed random variable with a 
mean of 0 and a variance of o2 that is constant over subjects and occasions. Equation 4 
is as follows: 


Vig = Vit + Er (4) 


290 DATA-ANALYTIC METHODS 


Now that we have specified the complete model, we can, as before, identify the 
factors that influence the power of any given sample to detect an effect. In this case, 
the effect of interest is B;, the effect of the within-subject predictor XW. There are eight 
factors in all, three more than we discussed for the between-subject regression model, as 
Equation 5 shows. 


Pwr = fB, N, T, Ok Ok, P, 02, 0) (5) 


The first is B,, the effect size for within-subject X for the average subject in the popu- 
lation. The bigger this is, the more power a given sample has to detect it. Next is N, the 
number of subjects in the sample; more subjects means more power. The third factor is 
T, the total number of measurement occasions per subject. Again, the more occasions 
T over which XW is measured, the more power to detect its effect. The fourth factor is 
the variance in XW, O?xy. The more X varies within each subject, the more power the 
study has to detect within-subject effects. The fifth factor is oĝ,, how much people dif- 
fer in the within-subject effect of X on Y. The more people in the population differ from 
one another in this effect, the harder it is to be confident that the average person has a 
nonzero coefficient. The sixth is p, the autocorrelation coefficient. The larger the autocor- 
relation, the less new information is gained when an additional measurement occasion is 
added to a study and, therefore, the lower the power of the study to detect effects of X W. 
The seventh is the true within-subject random noise, 02; the greater this is, the lower the 
power. Finally, as before, as the significance level œ is made more stringent (i.e., made 
smaller), power is decreased. 


Physical Activity and Depression Example 


The best way to learn about a power analysis for intensive longitudinal data is by doing 
one, and a goal of this chapter is to provide the reader with the necessary example data 
and software commands. In this section we illustrate the process using a simulated dataset 
on daily physical activity and depressive symptoms. The data were simulated to resemble 
an actual dataset of 66 participants who wore accelerometers and completed evening 
diaries for a period of 9 consecutive days. 

The power analysis involves two main steps. Step 1 involves using the sample data- 
set to get descriptive statistics for the predictor variables and to estimate the equivalent 
of Equations 3 and 4 using standard multilevel or mixed model software that allow for 
autocorrelated within-subject errors. Below we provide the syntax for doing so in SAS 
and IBM SPSS (IBM Inc., 2010). We believe that comparable analyses can be accom- 
plished using other software such as R, HLM (Raudenbush, Bryk, Cheong, Congdon, & 
du Toit, 2011), and MLwin (Rasbash, Charlton, Browne, Healy, & Cameron, 2011). 

Step 2 involves using the parameter estimates obtained in Step 1 together with pre- 
dictor descriptive statistics (means and variances) as population values for Monte Carlo 
simulations using Mplus (as described in Muthén & Muthén, 2002). One uses Mplus to 
(1) specify a population multilevel model with specific parameter values (e.g., for B,, OX, 
64,, from Equation 3), (2) generate many (e.g., 1,000) samples with one specific combi- 
nation of N and T (e.g., N = 66, T = 9) from such a population model, (3) analyze each 
simulated sample as if it were a real one in order to obtain estimates of those same specific 
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population parameters, and (4) report what proportion of the samples resulted in statisti- 
cally significant results (using p < .05 as the a level) for any given parameter. 

To be more concrete: Imagine that one wishes to determine the power to detect a B4 
effect size of 0.20 units (e.g., the within-subject association between time-varying physi- 
cal activity and depression for the average subject), given that one has specified values for 
the other parameters in the model (e.g., a 6%, of 0.50) and one has chosen a particular 
N and T study design. If the simulations revealed that estimates of B, were significant in 
60% of the samples, this would mean that this particular study design had a power of .60 
to detect the specified ß, effect size. 

This procedure is complex, but in our opinion it is the best that is currently available 
for assessing power in intensive longitudinal studies. If you found the overview paragraph 
above hard to follow, then try it again after reading through the worked example below. 
You can download the full sample dataset (exampledata.sas7bdat for SAS and example- 
data.sav for SPSS) and syntaxes from our website at www.columbia.edu/~nb2229/publi- 
cations.html. As is typical for such datasets, it is in a vertical or long format: Each subject 
has multiple data lines, one for each time point. 

Table 16.1 shows the structure of the simulated dataset, including data lines for 
particular subjects. To facilitate interpretation of the analysis results, all independent 


TABLE 16.1. Data Example: Daily Physical Activity and Evening Depression 


id day dayc5 depr steps stepsB stepsW stepsB_GM _ stepsBc 
1 1 4 3.09 1.61 0.91 0.71 0.98 -0.08 
1 2 -3 2.63 0.79 0.91 -0.12 0.98 -0.08 
1 3 2 3:17 1.08 0.91 0.17 0.98 -0.08 
1 4 1 1.98 1.67 0.91 0.76 0.98 -0.08 
1 5 0 3.00 0.75 0.91 -0.15 0.98 -0.08 
1 6 1 3.32 0.80 0.91 -0.11 0.98 -0.08 
1 7 2 2.06 0.26 0.91 -0.64 0.98 -0.08 
1 8 3 2.02 0.81 0.91 -0.10 0.98 -0.08 
1 9 4 1.31 0.38 0.91 0.53 0.98 -0.08 
2 1 4 2:29 0.93 0.62 0.31 0.98 -0.36 
2 2 -3 2.95 0.42 0.62 -0.19 0.98 -0.36 
2 3 -2 2.86 0.68 0.62 0.06 0.98 -0.36 
2 4 -1 3.33 0.50 0.62 -0.12 0.98 -0.36 
2 5 0 2.56 0.39 0.62 -0.22 0.98 -0.36 
2 6 1 3.18 0.56 0.62 -0.06 0.98 -0.36 
2 7 2 2.96 0.80 0.62 0.18 0.98 -0.36 
2 8 3 4.18 0.65 0.62 0.04 0.98 -0.36 
2 9 4 2.26 0.62 0.62 0.00 0.98 -0.36 
66 1 4 1.45 1.37 1.07 0.30 0.98 0.09 
66 2 -3 3.08 0.45 1.07 -0.62 0.98 0.09 
66 3 -2 2.64 1.52 1.07 0.45 0.98 0.09 
66 4 -1 3.04 1.67 1.07 0.60 0.98 0.09 
66 5 0 1.75 0.97 1.07 -0.10 0.98 0.09 
66 6 1 2.91 1.01 1.07 -0.06 0.98 0.09 
66 7 2, 3.27 0.58 1.07 —0.49 0.98 0.09 
66 8 3 2.67 1.13 1.07 0.06 0.98 0.09 
66 9 4 1.68 0.92 1.07 -0.15 0.98 0.09 
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variables were centered either on the grand mean or the subject mean (see Cohen, Cohen, 
West, & Aiken, 2003, pp. 564-565). Table 16.1, column 4 contains scores on the depen- 
dent variable, depr. The depression variable was simulated to resemble a mean of several 
depression items, each scaled from 0 to 5, as might be reported at the end of each day 
in an online diary. Variables indexing time are in columns 2 and 3: Day is simply the 
study day coded 1-9. Dayc5 is a version of day centered on the middle day of the study 
(Day 5). 

Column 5 shows the simulated physical activity variable, steps; this is the number 
of steps taken each day (where 1 unit equals 10,000 steps), as might be obtained from a 
portable accelerometer. Physical activity was simulated to vary both within subject and 
between subject. Therefore, as with the between- and within-subject X variable in Equa- 
tion 3 earlier, steps was split into two predictor variables: (1) stepsB, the between-subjects 
means of steps across all diary days and (2) stepsW, the within-subject deviations from 
these between-subject means. StepsBc, a centered version of stepsB, was created by sub- 
tracting the grand mean (stepsB_GM) from stepsB. 

Because the later simulations require means and variances for all predictor variables, 
these descriptive statistics need to be calculated (e.g., using PROC UNIVARIATE in SAS 
or using DESCRIPTIVES in SPSS). StepsBc, the grand mean-centered subject means 
across all available diary days, has a mean of 0.00 (due to centering) and a variance of 
0.096. StepsW, the within-subject deviations from each subject’s mean, has a mean of 
0.00 and variance of 0.157. 

We simulated the dataset to have missing observations on 15% of days; this level of 
missingness corresponds to what has been observed in a real dataset on physical activ- 
ity and depression. Therefore, instead of a complete dataset with 9 time points for all 
66 participants (i.e., 594 total observations), the current dataset contains a total of 507 
observations, which corresponds to having an average of 7.7 out of 9 observations. 

We now turn to accomplishing Step 1 using PROC MIXED in SAS. To review, Step 
1 involves analyzing the sample dataset to obtain estimates of the parameters that we will 
use together with predictor means and variances in the power analysis in Step 2. Below is 
the exact SAS PROC MIXED syntax: 


*SAS Mixed Model for Physical Activity and Depression Dataset; 
PROC MIXED DATA=power.exampledata METHOD=ml COVTEST; 
CLASS id day ; 
MODEL depr = stepsBc stepsW dayc5 /S DDF = 64, 65, 65; 
RANDOM int stepsW /SUBJECT = id TYPE = un; 
REPEATED day /SUBJECT = id TYPE = ar(1); 
RUN; 


Note that the SAS syntax requires two variables for time: a version for use in the 
MODEL statement and a version for use in the REPEATED statement. For the first, we 
used dayc5, the variable day centered at Day 5; for the second, we used the uncentered 
time variable, called day. The variable dayc5 is used in the model statement to get an 
estimate of the fixed effect of time (ß, in Equation 3); this variable could be included in 
the random statement if we wanted to include a random effect for time. The time vari- 
able, day, in the repeated statement is used in obtaining an estimate of p, the first-order 
autocorrelation parameter. This variable also has to appear in the CLASS statement. 
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First-order autocorrelation is specified using TYPE = AR(1) inthe REPEATED statement. 
An alternative choice for the error covariance structure would be a Toeplitz structure; for 
example, to capture a lag - 1 correlation with zero correlation at longer lags, we substi- 
tute type = toep(1) for type = ar(l). 

Note also that the estimation method we used was full information maximum likeli- 
hood (METHOD=ML) rather than the more standard restricted maximum likelihood 
(METHOD=REML). We did so in order to be consistent with the simulation software to 
be used in Step 2; that software, Mplus, does not allow REML estimation. For intensive 
longitudinal datasets, the difference in estimator will usually matter very little. Finally, 
rather than use the PROC MIXED defaults, we used the DDF option on the MODEL 
statement to specify degrees of freedom for tests of fixed effects. For the between-subject 
predictor stepsBc, we used N minus 2 degrees of freedom, that is, 66 — 2 = 64, because 
there were two between-subject predictors, the intercept and stepsBc. For the within-sub- 
ject predictors stepsW and dayc5, we used N minus 1 degrees of freedom, that is, 66 - 1 
= 65, because there were no cross-level interactions. For more details on relevant options 
in SAS PROC MIXED, see Bolger and Laurenceau (in press). 

The simulated example data used in this chapter do not show much remaining auto- 
correlation (6 = -0.048), but many intensive longitudinal datasets do show appreciable 
autocorrelation, and we believe it is important to include this parameter in the model. 
Failure to adjust for autocorrelation results in overly small estimates of within-subject 
error variance, and causes an upward bias in test statistics, such at t values and F values 
(Fitzmaurice et al., 2004). 

Running this model in SAS gives the following output. 


Dimensions 


Covariance Parameters 5 


Columns in X 4 

Columns in Z Per 2 
Subject 

Subjects 66 


Max Obs Per Subject 9 


Number of Observations 
Number of Observations Read 507 


Number of Observations Used 507 


Number of Observations Not Used 0 


Solution for Fixed Effects 


Effect Estimate Standard Error DF t value Pr > |t| 
Intercept 2.6624 0.0575 64 46.30 <.0001 
stepsBc -0.6748 0.1899 64 =3.95 0.0007 

> stepsW -0.1857 0.0978 65 -1.90 0.0621 


dayc5 -0.0138 0.0110 65 -1.25 0.2166 
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Covariance Parameter Estimates 
Cov Parm Subject Estimate Standard Error Z Value Pr Z 
UN(1,1) id 0.1688 0.0386 4.38 <.0001 
UN(2,1) id -0.0152 0.0464 =0,,33 0.7437 
UN(2,2) id 0.2154 0.0947 2.27 0.0115 
AR(1) id -0.0475 0.0605 -0.78 0.4330 
Residual 0.3946 0.0285 13.85 <.0001 


The following SPSS syntax gives the same results: 


*SPSS Mixed Model for Physical Activity and Depression Dataset. 
MIXED depr WITH stepsBc stepsW dayc5 
/FIXED=stepsBc stepsW dayc5 | SSTYP! 
/METHOD=ML 
/PRINT=SOLUTION 
/RANDOM=INTERCEPT stepsW | SUBJECT(id) COVTYPE (un) 
/REPEATED=day SUBJECT(id) COVTYPE(AR1). 


BJ 


(3) 


Table 16.2 organizes these mixed model results in a form recommended for jour- 
nal articles (see American Psychological Association, 2009, pp. 147-148; Bolger & 
Laurenceau, in press). Given that diary and intensive longitudinal studies are usually 
carried out to assess within-subject associations, the single most important parameter 
estimate is B4, the within-subject relationship between daily physical activity and depres- 
sion for the average subject: On days when the typical subject’s physical activity was one 
unit above his or her typical level, daily depression was lower by -0.19 units. An effect of 
this size is small but not trivial: It is equivalent to a Cohen’s d of 0.28 within-person SD 
units (see notes at the bottom of Table 16.2). The sampling error for the unstandardized 
estimate is sufficiently large, however, that using a 95% confidence interval we cannot 
rule out zero as a possible population value (95% CI = -0.38, 0.01). 

We simulated these data to produce inconclusive results because this is acommon 
situation in the early stages of a research program; for example, when researchers are in 
the process of developing a grant proposal they often conduct a small and underpowered 
pilot study. It is in this situation that one needs guidance on how much to increase the 
number of persons and time points to ensure sufficiently powered future studies. Fortu- 
nately, the ability to conduct such power simulations for diary and intensive longitudinal 
studies has recently become available. 

Step 2 begins with the researcher conducting a power simulation using the N and 
T for the initial dataset, the predictor means and variances, and the parameter esti- 
mates from the multilevel model in Step 1. Note that although the simulated pilot study 
(and most real diary studies) had missing data, we ran the power simulations under the 
assumption of no missing data. Making such an assumption did not alter the power 
estimates appreciably and, as will be seen, it simplifies the task of creating “what-if” sce- 
narios where the number of persons and time points are systematically varied. 

The model used in the power simulations differed from the model estimated on the 
pilot data in two other ways: it did not include an autocorrelation parameter and it did 
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TABLE 16.2. Parameter Estimates for Multilevel Model of Daily Physical 
Activity Predicting Evening Depression 


Parameter Estimate SE 
Fixed effect 
Intercept Intercept Bo 2.66 ** 0.058 
Activity stepsW Bi -0.19 t 0.098 
stepsBc B, -0.68 ** 0.190 
Time dayc5 Bs -0.01 0.011 
Random effects 
Within-subject: Error oz 0.39: ** 0.029 
Autocorrelation p -0.05 0.061 
Between-subject: Intercept Oh, 0.17 ** 0.039 
stepsW OR, 0.22 * 0.095 
Covariance O Bo, B1) -0.02 0.046 


Note. The number of subjects was 66; the number of days averaged 7.7 out of a possible 9; and the total 
number of observations was 507 out of a possible 594 (85%). The variance decomposition for evening 
depression was 0.65, 0.20, and 0.45, for total, between-subject, and within-subject variances, respectively. 
The corresponding standard deviations were 0.81, 0.45, and 0.67. The estimate for ß, in within-person SD 
units is 0.19/0.67 = 0.28. This corresponds to a small effect size in Cohen’s (1988) terminology (see Bolger 
& Laurenceau, in press, for discussion of effect sizes in diary and intensive longitudinal designs). 


tp < 10; *p < .05; **p < .01. 


not include a time trend. The first omission is because Mplus cannot specify autocorre- 
lation when data are analyzed in the long form (as they almost invariably are for diary 
and intensive longitudinal data). The second omission is because the free demonstration 
version of Mplus allows no more than two independent variables. Given the choice of 
dropping stepsW, stepsB, or dayc5, we dropped dayc5, the nonsignificant time trend 
variable. 

The reader may wonder about the value of a power simulation that omits parameters 
and variables that would be included in typical analysis models for diary and intensive 
longitudinal data. The reason it is possible to omit these in the simulations is that the 
omitted variables and parameters are controls only. Their inclusion in the PROC MIXED 
analyses was to ensure that the other, central parameters such as the fixed and random 
effects of stepsW would be free of bias. By conducting simulation using these presum- 
ably debiased parameter estimates, we are in effect obtaining the results that would be 
obtained had we included these controls. We urge readers who possess the full version of 
Mplus to run simulations with the fully parametrized model to convince themselves that 
this is the case (more on this point later). 

The initial power simulation below is one that represents a baseline against which 
subsequent simulations will be compared. Specifically, it uses the parameter estimates 
from the pilot data as population values and also uses the number of persons and time 
points that would have been observed had there been no missing data. With these initial 
settings, the results of the simulation should tell us what we already know, that the pilot 
study was underpowered for estimating the within-subject effect of physical activity on 
depression. Having conducted this baseline simulation, we then proceed to specifying 
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various what-if scenarios of the kind that might be used in developing a grant applica- 


tion. 
The Mplus syntax is given next. Syntax lines with specific comments are explained 
later in the text (e.g., the line with the comment ! (a) is explained under the heading (a) 


in the later text). 


!Exclamation point indicates a comment 
MONTECARLO:NAMES ARE depr stepsW stepsBc; !{a} 
NOBSERVATIONS = 594; !{b} 
NCSIZES = 1; 
CSIZES = 66 (9); !{c} 
SEED = 5859; !{d} 
NREPS = 1000; !{e} 
WITHIN = stepsw ; !{f} 
BETWEEN = stepsBc; !{g} 
ANALYSIS: TYPE = TWOLEVEL RANDOM; !{h} 


MODEL POPULATION: 
SWITHINS !{i} 
slope depr ON stepsW; 
[ stepsW*0.00 ]; 
depr*0.395; stepsW*0.157; 
BETWEEN? !{j} 
depr ON stepsBc*-0.675; 
depr WITH slope*-0.015; 
[ depr*2.662 ]; [ slope*-0.186 ]; [ stepsBc*0 ]; 
depr*0.169; slope*0.215; stepsBc*0.096; 
MODEL: !{k} 
SWITHINS 
slope depr ON stepsW; 
[ stepsW*0.00 ]; 
depr*0.395; stepsw*0.157; 
%BETWEEN® 
depr ON stepsBc*-0.675; 
depr WITH slope*-0.015; 
[ depr*2.662 ]; [ slope*-0.186 ]; [ stepsBc*0 ]; 
depr*0.169; slope*0.215; stepsBc*0.096; 


The syntax contains the following pieces of information: 


a. a list of the variables in the model (in our example, depr, stepsW, stepsBc); 

b. the total number of observations calculated as the product of the number of sub- 
jects N and the number of time points T (i.e., 594 = 66 subjects * 9 time points); 

c. the number of subjects N (66) and time points T (9); 

d. an arbitrary numerical seed (5859) that enables one to rerun the simulation at a 
later time and get the same results; 

e. the number of simulations (1000); 

f. predictors that vary within subjects only (stepsW); 

g. predictors that vary between subjects only (stepsBc); 
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h. the type of model to be run, a two-level random effects model (TYPE = 
TWOLEVEL RANDOM); 
i. specifications for the within-subject part of the model, including 
e arandom slope for the within-subject predictor (slope | depr ON stepsW), the 
within-subject residual variance (depr*0.395) obtained from the SAS/SPSS 
multilevel model; 
e the variance (stepsW*0.157) and the mean of the within-subject predictor ([ 
stepsW*0.00 |) obtained from the descriptive statistics of stepsW; 
j. specifications for the between-subject part of the model, including 
e the fixed effect of the between-subject predictor (depr ON stepsBc*—0.675), the 
covariance of intercept and the within-subject predictor’s slope (depr WITH 
slope*-0.015), and the means of the intercept and the within-subject predic- 
tor’s slope (| depr*2.662 ], [ slope*—0.186 ]), all obtained from the SAS/SPSS 
multilevel model; 
e the mean of the between-subject predictor (| stepsBc*0 ]) obtained from the 
descriptive statistics of stepsBc; 
e the variance of the intercept and the variance of the within-subject predictor’s 
slope obtained from the SAS/SPSS multilevel model (depr*0.169; slope* 0.215); 
e the variance of the between-subject predictor (stepsBc*0.096) obtained from 
the descriptive statistics of stepsBc; 
k. a specification of the analysis model that is (in our case) identical to the simula- 
tion model. 


The following are the results obtained using the demonstration version of Mplus 
version 6.1. 


MODEL RESULTS 
ESTIMATES S. Es M. S. E. 95% %Sig 
Population Average Std. Dev. Average Cover Coeff 

Within Level 
Means 

STEPSW 0.000 0.0007 0.0166 0.0161 0.0003 0.938 0.062 
Variances 

STEPSW 0.157 0.1567 0.0091 0.0090 0.0001 0.936 1.000 
Residual Variances 

DEPR 0.395 0.3939 0.0259 0.0253 0.0007 0.934 1.000 
Between Level 
DEPR O 

STEPSBC -0.675 -0.6728 0:1937 0.1800 0.0375 0.919 0.933 
DEPR WITH 

SLOPE -0.015 -0.0151 0.0436 0.0409 0.0019 0.941 0.074 
Means 

STEPSBC 0.000 0.0017 0.0380 0.0378 0.0014 0.948 0.052 

SLOPE -0.186 | -0.1795 0.0892 0.0907 0.0080 0.950 0.502 
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Intercepts 
DEPR 2.662 2.6642 0.0577 0.0565 0.0033 0.942 1.000 
Variances 
STEPSBC 0.096 0.0950 0.0164 0.0161 0.0003 0.918 1.000 
SLOPE 0.215. 0.2117 0.0915 0.0907 0.0084 0.911 0.689 


Residual Variances 
DEPR 0.169 0.1627 0.0374 0.0352 0.0014 0.897 1.000 


The partial output above summarizes the results of the 1,000 simulated samples. 
The last column of the output (with the header % Sig Coeff) gives power estimates for 
each parameter, that is, the percentage of samples in which the parameter estimate was 
statistically significant at & = .05. The key power estimate is the one for the fixed effect 
of stepsW, which is the within-subject relationship between daily physical activity and 
depression for the average subject (B; in Equation 5). The estimate is 0.502, which means 
that in only 50% of the 1,000 simulated samples did the slope for the within-subject 
predictor reach statistical significance. This, of course, is what one would expect to find 
given that simulation is based on the original underpowered pilot study. 


Power Curves for the Within-Subject Effect 
for Varying Numbers of Subjects and Time Points 


We have now reached the point where we can conduct power simulations that would 
result in grant writing and planning follow-up studies. We focus on the following prag- 
matic question that is common in diary and intensive longitudinal studies: Other things 
being equal, in order to detect a within-subject effect, is it better to increase the number 
of time points per person or the number of persons per time point? With the earlier Mplus 
syntax in hand, it is relatively easy to answer this question. To do so, we rerun the syntax, 
keeping all the simulation parameters the same except for number of persons and time 
points. 

Figure 16.1 displays the results of 10 Mplus power simulations, involving combina- 
tions of increased participants and time points. To understand the figure, recall that the 
baseline power simulation used a combination of 66 subjects and 9 time points, resulting 
in a total of 594 observations. Now consider a total number of observations of 660. As 
shown below the horizontal axis in Figure 16.1, this can be accomplished in two ways: 
by increasing the number of subjects from 66 to 73 (approximately) while keeping the 
time points at 9, or by increasing the number of time points from 9 to 10 while keeping 
the subjects at 66. The net result is approximately the same increase in power, from 50% 
to 59%. 

As we move, however, from 660 observations to 990, the power increases begin to 
diverge. To reach 990 observations, we can keep the time points at 9 and increase the 
subjects from 73 to 110, or we can keep the subjects at 66 and increase time points from 
10 to 15. Figure 16.1 shows that power is improved more by increasing subjects than time 
points, and the differential increases as we examine further equivalent combinations of 
persons and time points. A power of 80%, the magic number for grant applications, can 
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0.8 


Adding Time Points 


\ Power of Hypothesis 
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0.4 
N*T 660 990 1320 1650 1980 N*T Total Observations 
O——1— 1 
N*9 73 110 147 183 220 N*9 Adding Subjects 
e—a 
66*T 10 15 20 25 30 66*T Adding Time 
Points 


FIGURE 16.1. Power curves for the fixed effect of physical activity: What is the benefit of adding 
subject versus time points to the sample? 


be achieved by increasing the number of subjects to just under 120 while keeping the time 
points to 9; by contrast, even were we to increase the number of time points to 30, while 
keeping subjects at 66, we would still only reach a power of 78%. 

Figure 16.1 shows nicely the phenomenon that has been highlighted in the past (e.g., 
Snijders & Bosker, 1999), that increasing upper-level units can often result in more power 
than increasing the number of lower-level units. The tradeoff, however, can be different 
when the cost of increasing subjects is substantially greater than the cost of increasing 
time points. In diary studies, for example, it can often be more expensive to add subjects 
than time points. 

The simulation approach described in this chapter has some limitations. First, it does 
not take account of missing data and the accompanying loss of power. One solution is to 
draw on previous research to make assumptions about how many participants and time 
points will be lost, and adjust the sample size used in the Mplus simulation. Another is 
to use Mplus’s ability to simulate datasets with particular patterns of missing data (see 
Muthén & Muthén, 1998-2010, chap. 12). Yet another solution is to use Zhang and 
Wang’s (2009) SAS power analysis macro that allows one to specify missing data. 

The second limitation is that in order to use the demonstration version of Mplus we 
illustrated the power simulation with a scaled-back model of the data generating process. 
Although we believe that this approach does not bias the power estimates, it is likely 
that they are more imprecise than estimates derived from simulating the full model that 
provided us with the parameter values in the first place. On our chapter website, www. 
columbia.edu/-nb2229/publications.btml, we provide the Mplus syntax for a power 
simulation based on the full model. 
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Summary 


We hope that this chapter illustrates how to conduct a simulation-based power analysis 
for within-subject effects in models of diary and intensive longitudinal data. To conduct 
a power analysis based on Monte Carlo simulation, one needs a lot of information about 
the model to be tested. In most cases researchers will need to have already conducted at 
least some pilot work in order to specify hypothesized parameter values. It may be pos- 
sible in some cases, however, to make informed guesses about the parameters based on 
similar published studies. For a more general and in-depth treatment of power simula- 
tion using Mplus, see Muthén and Muthén (2002). For a more extended introduction to 
power analysis for diary and intensive longitudinal studies, with a variety of datasets and 
statistical models, see Bolger and Laurenceau (2013). 
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CHAPTER 17 


Psychometrics 


PATRICK E. SHROUT 
SEAN P. LANE 


basic provision of good research design is that measures should have good psy- 

chometric properties; that is, the measures should reflect the constructs of scien- 
tific interest and have minimal measurement error. Psychometric texts (e.g., Crocker & 
Algina, 1986; Embretson & Reise, 2000) instruct researchers on how to achieve these 
goals for single time point individual-difference measures, but rarely is attention paid to 
the psychometrics of intensive repeated measures, such as those obtained in diary studies. 
In this chapter we aim to provide guidelines and techniques for this special measurement 
context. 

Our focus is on the classic psychometric concepts of reliability and validity. We say 
that a measure is reliable when repeated applications of the measurement procedure pro- 
duce the same numerical value. We say that a measure has validity if there is evidence that 
scores from the measurement procedure display empirical patterns that are consistent 
with the theoretical construct of interest. Reliability is necessary but not sufficient for 
validity (e.g., Shrout & Lane, in press). It is possible to obtain a highly reliable measure 
that is not at all related to the construct for which it is named. 

We begin the chapter with a brief review of some of the special characteristics of 
diary studies that affect measurement quality. We then consider issues of reliability for 
diary data, noting that different definitions of reliability need to be used for between- 
person comparisons and within-person questions. Following our practical description of 
how to estimate reliability in diary studies, we consider issues of documenting the valid- 
ity of diary measures. Throughout the chapter we illustrate the points with a numerical 
example based on an actual diary study. 


Overview of Psychometric Issues in Diary Studies 


Some researchers are drawn to diary studies because diary designs seem to make the 
measurement process more transparent and less prone to error than traditional retrospec- 
tive methods. For example, a researcher interested in how many days in a week a patient 
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experiences depression might ask for daily reports of levels of positive affect, negative 
affect, and anhedonia. The patient can reflect upon and report the extent and severity of 
these symptoms in the current day. Contrast this straightforward approach to the widely 
used retrospective measure of depression symptoms called the Center of Epidemiologic 
Studies Depression Scale (CESD; Radloff, 1977), which asks respondents to describe how 
often in the past week 20 different symptoms occurred, ranging from 0 (Rarely or none 
of the time: Less than 1 day) to 3 (Most or all of the time: 5-7 days). Error in the CESD 
can arise if a participant fails to recall the number of days that he or she “felt sad” or 
includes days outside of the designated 7-day period. Respondents might also choose to 
report high scores, such as a 3, to incorporate severity of the sadness, even if it lasted for 
only 4 days. In other words, the retrospective self-report of moods such as sadness can 
be influenced by a number of cognitive and affective error processes that are eliminated 
with daily reports. 

We develop this illustration further in an empirical example for the chapter. We 
report data from 209 persons who disclosed the extent to which they were experiencing 
moods such as “sad” or “blue” on a 5-point scale. Clearly the participant’s cognitive task 
is much simpler for the diary item than for the CESD item, and recall errors should be 
effectively eliminated. If 7 days of diary reports are combined to produce a summary, 
as the CESD attempts to do, one would expect that it would have more validity than a 
reconstruction of the week from memory. 

The tradeoff in the diary study is that it would not usually be feasible to ask respon- 
dents to report on 20 different symptoms of depression, 20 on anxiety, 20 on self-esteem, 
and so on, in a protocol that asks for reports every day for 44 days. Most diary research- 
ers try to shorten the diary protocol to minimize participant burden. Although we under- 
stand this goal, we urge the inclusion of at least three daily items for each construct 
being studied. Redundancy of item content is needed to document the reliability of the 
reports. In some cases (e.g., when respondents are making judgments in a hurry, or in a 
distracting everyday context) it is necessary to average the replicate measures to obtain 
reliable daily scores. The formal psychometric analyses we describe later can be informa- 
tive about how many replicate items are needed for a given construct. 

Replicate items in a diary protocol are useful for psychometric analyses only if they 
are presented in a way that avoids memory effects and response biases. If a respondent is 
asked to rate three related moods on a small smartphone screen, he or she will recognize 
that the answers should be consistent. The high level of concordance would not provide 
an accurate picture of the care the participant gives in making the diary responses. A bet- 
ter design would be to separate replicate items in the protocol, so that each is answered 
independently. When possible, the response options should be reversed, so that partici- 
pants are forced to use the full range of choices. 

In addition to the quantitative psychometric analyses that we discuss next, good 
psychometric practice requires qualitative analyses and careful pilot testing of any mea- 
sures that are to be used in a diary study. Due to length concerns it is often not possible to 
import literal questions from between-person measures of attitude, relationships, affect, 
or health problems into a study that asks persons to give reports many times a week. 
Prior to primary data collection, a qualitative phase of psychometric analysis might ask a 
sample of participants to answer questions in an open-ended format, followed by a focus 
group discussion of how the repeated questions were experienced. For example, suppose 
that someone is interested in activation of various aspects of identity, such as ethnic, 


304 DATA-ANALYTIC METHODS 


gender, or age identity. How easy is it for respondents to focus on the importance of 
being female or Latino at a given diary collection time? What way of phrasing the ques- 
tion makes it easy for participants to reflect on the time-varying construct of interest? 
How can the same question be asked more than once to check consistency without being 
annoying? These are questions that require careful pilot work to answer. 


Estimating Reliability of Measures from Diary Studies 


In conventional experimental and survey studies, the measures are designed to assess dif- 
ferences between people. Reliability in these contexts is measured by obtaining empirical 
replications of the measurement procedure (Crocker & Algina, 1986; Shrout & Lane, in 
press). The two most common reliability designs are test-retest reliability designs and 
internal consistency designs. Test-retest designs literally ask that the measures be taken 
twice (or more), over an interval that maintains the stability of the underlying construct 
and minimizes memory or measurement biases. For example, participants might be asked 
to complete the CESD on 2 consecutive days, with the assumption that the recalled days 
are very similar but recall for past responses to specific items is not a problem. The internal 
consistency design, which usually leads to the computation of Cronbach’s alpha (Cron- 
bach, 1951), considers different items within a scale to be replicate measures. This popular 
design allows reliability to be estimated from a single administration of the scale. 

The test-retest psychometric model is the parallel test form model from classical test 
theory (Crocker & Algina, 1986, p. 124). Let X,, represent the measurement of person j 
at time k. For example, X; and X,, might be two administrations of the CESD on con- 
secutive days. The parallel measure model is 
Xp = P+ ey (1) 


1 


where P; is the time-invariant person-level quantity, and e; is the error of measurement at 
time k for person j. The parallel measure model assumes that the two measures (test and 
retest) have the same amount of measurement error (i.e., that Var(e,,) is constant across 
measurement occasions) and that the expected score is the same for both measures (i.e., 
E(X) = E(X,,) = P, where E(X) represents the statistical expected value of the measure 
over many possible replicate measures). It also assumes that the error terms are indepen- 
dent across replicate measurements. 

The reliability coefficient of classical test theory is given as the proportion of the 
variance of a measure that can be attributed to individual differences in the true score, 
P;. Let 0% represent the variance of X across people, and let 05 represent the variance of 
the true score, P. Finally, let 6} represent the variance of the random measurement error 
(assumed constant across persons and replications). Then variance of the measure (scale 
for test-retest, items for internal consistency) can be decomposed as 6% = 05 + 07, and the 
reliability coefficient is expressed as 


ee (2 


Ry= 
2 2 2 
Oy Op tO; 


When reliability is assessed with a test-retest design, the simple Pearson correlation 
between the two measures provides a good estimate of Ry (Crocker & Algina, 1986, 
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p. 133). Note that this correlation is not affected by systematic changes in the mean of X 
across the two time points, but only by the relative ordering of persons with respect to 
their scores on X at the two times. 

The psychometric model for the internal consistency design is somewhat different 
from that shown in Equation 1. Instead of saying that the overall scale is represented by 
X;4, we let M; represent the ith item measurement in the scale. We assume that there are 


1 
m items in the scale. The parallel measure model in this case becomes 


M; =P, +e; (3) 


As before, the parallel measure model assumes that the m items are all measuring the 
same value of P, (called the “true score”), and that the errors are independent and have 
the same variance across different items. 

In principle we could estimate the average reliability for a single item; we call that 
reliability estimate R y. However, we are usually interested in the reliability of the scale, 
which can be calculated as the average of the m items: 


The reliability of the scale is almost always larger than the reliability of the items, because 
the impact of the error variation is reduced by averaging. The Spearman—Brown formula 
(Crocker & Algina, 1986, pp. 118-119) can be used to express the reliability of the scale 
(Ry) as a function of the reliability of the typical item (R m). This is shown in the left por- 
tion of Equation 4, which also shows on the right how the size of the error variation is 
reduced by a factor of 1/m. 


2 
mR, Op 


“x= Dafn- DR] 3 oz) 


(4) 


When the Spearman-Brown formula is applied to internal consistency data, the esti- 
mate is called Cronbach’s alpha (Cronbach, 1951). Alpha is calculated by a number of 
statistical software systems such as Statistical Package for the Social Sciences (SPSS 18; 
SPSS, Inc., 2010), Statistical Analysis Software (SAS 9.2; SAS Institute, Inc., 2008), and 
R (Revelle, 2010). To illustrate how alpha is calculated in practice, we constructed a 
simple example in which five persons supposedly completed three mood items on each of 
4 days (Table 17.1). Figure 17.1 shows computer syntax for calculating Cronbach’s alpha 
(Ry) at each of the four time points. The syntax for SPSS 18 also produces estimates of 
the item-level reliability (R m), in addition to the scale-level estimate. Both of the software 
programs require that the responses to different items be listed in separate columns, with 
each row representing a different person (at each time). For the four time points in Table 
17.1 the results are Ryı(1) = .69, Ry(1) = 87; Ry(2) = .70, Ry(2) = .88; R „(3) = 80, Rx(3) 
= .92; Ry,(4) = .80, Ry(4) = .92. These values describe the reliability of an individual item 
(Ry) or the entire scale (Ry) in distinguishing the individuals at each of the four time 
points. 

As the example in Table 17.1 also illustrates, diary designs with multiple items mea- 
suring each scale provide combinations of repeated measures and internal consistency 
designs. Cranford and colleagues (2006) described how information from replications 
across items and across time can be combined to estimate reliability using a generaliz- 
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TABLE 17.1. Simple Numerical Example of Five Respondents 
Answering Three Items at Four Time Points 


Person Time Item 1 Item 2 Item 3 
1 1 2 3 6 
2 1 3 4 4 
3 1 6 6 S 
4 1 3 4 3 
5 1 7 8 7 
1 2 3 3 4 
2 2 5 7 7 
3 2 6 7 8 
4 2 3 5 9 
5 2 8 8 9 
1 3 4 2 5 
2 3 4 6 7 
3 3 7 8 9 
4 3 5 6 7 
5 3 6 7 8 
1 4 1 3 4 
2 4 5 9 7 
3 4 8 9 9 
4 4 8 7 9 
5 4 6 8 6 


Note. Responses are constructed, not real data. Items and times are assumed to be 
the same for each person. 


ability theory (GT) analysis (Brennan, 2001; Cronbach, Gleser, Nanda, & Rajaratnam, 
1972). In addition to decomposing the variation of diary items into a between-person 
component and measurement error, they describe how variation may be affected by 
within- and between-person processes. The psychometric model reported by Cranford 
and colleagues has nine components: 


Mix =4u+l;+ P, +T, + (IP); + (IT), + (PT) x + (IPT) x + ip (5) 


M; is the measure for the ith item for the jth person at the kth time point. The first term, 
u, is simply the population mean of the measure.! I, represents the tendency for the ith 
item to obtain relatively higher or lower average endorsements. P, corresponds to the 
between-person effect described in Equations 1 and 3; it describes the tendency for per- 
son j to have higher or lower scores on measures, regardless of the item or the time point. 
T, represents the tendency for the kth time point to have higher or lower values across 
persons and items. It might reflect, for example, tendencies for weekend days to be asso- 
ciated with lower anxiety or higher intimacy than weekdays. (IP); is an interaction that 
represents any tendency for person j to respond higher or lower to item i. For example, 
if a diary item has a slightly different meaning to persons trained in British English rela- 
tive to those raised with American English, then this term would represent the systematic 
effect. (IT), is the tendency for item i to receive systematically different scores at time k 
than on other days, across all persons. An example discussed by Cranford and colleagues 
(2006) is that “discouraged” might be interpreted as global negative affect on nonstress 
days but a task-specific evaluation on days with high personal challenge. More often than 
not, this effect would be of methodological rather than substantive interest. In contrast, 
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(PT), is an interaction of special interest to diary researchers. It represents the tendency 
for person j to have a systematically different score at time k than other persons. By sys- 
tematic, we mean that the pattern is apparent in all the items, suggesting genuine change 
for person j at time k. The final two terms represent residual error processes. (IPT); is 
the tendency of person j to respond to item iin a unique way at time k relative to the other 
times, and e; represents any random response error that occurs when person j answers 
item i at time k. 

Equation 5 assumes that it is meaningful to distinguish times and items, as well 
as people, in the diary study. This assumption makes most sense when diary times are 
referenced to an anchor event, such as an exam or a vacation period. When times are 
arbitrary, such as in event-contingent diary designs, some modification of the Cranford 
approach is needed. We discuss that issue later. 

When times can at least be meaningfully ordered, a GT analysis of Equation 5 
decomposes the overall variance of M; into the variance components corresponding to 
item, person, time, and the two-way interactions. Because there are not replicate mea- 
surements of each item at each time, the three-way interaction is typically combined 
with the error term. To carry out the decomposition, a number of assumptions are made. 
First, the components in Equation 2 that represent the true score and error variation 
are assumed to be independent. Second, the person, item, time, and error variances are 
assumed to apply to all persons and items, and to time. Third, the relation of responses 
to each item to the true score, P, is assumed to be constant—all items are considered to 
be simple replications, and none is considered to be more relevant to the concept than 
the others. The independence assumption implies that it is not necessary to consider the 
order of the repeated measures (i.e., that adjacent measures are no more correlated than 
measures distant in time). In a later section we consider the implications of the fact that 
these assumptions are likely to be violated in actual daily diary data. 

Statistical software such as SPSS and SAS provides procedures to obtain estimates of 
the variance components of Equation 5, and these are illustrated in the syntax example 
shown in Figure 17.1. Both software systems require that data such as that shown in 
Table 17.1 be reformatted, so that each item response is on a separate line of data (a for- 
mat that is called stacked data). The responses are identified by variables that indicate 
which person, time, and item response is recorded on each line. Both SPSS and SAS will 
reformat the data to be stacked, and the syntax is included in Figure 17.1. Once the data 
are reorganized, there are two different programs that can provide the variance compo- 
nent estimates. The first program in both SPSS and SAS is called VARCOMP, and each 
requires that a model reflecting Equation 5 be entered. In both, we suggest that the vari- 
ance estimates be obtained using an analysis of variance (ANOVA) approach. As Searle 
(1995) describes, this approach works well when there are no missing data. In this case 
we assume that each person answers the same three items at each of the four times. The 
second program that can be used is the MIXED approach, which fits a multilevel model 
to the item response data (see Nezlek, Chapter 20, this volume). This approach focuses 
on the variance components that involve the person: the between-person variance and 
the person-by-item and person-by-time interaction variances. The estimation method is 
restricted maximum likelihood (REML). For the example shown in Table 17.1 the differ- 
ent approaches give exactly the same numerical results (see Table 17.2). 

Once the estimates of the variance components are obtained, the reliability estimates 
can be constructed as an extension of the internal consistency approach to the general 


Psychometrics 309 


TABLE 17.2. Variance Components for VARCOMP(ANOVA) and MIXED Methods 
for the Simple Numerical Example 


Variance component Variance notation VARCOMP MIXED 
Variability across persons OČ PERSON 2.343 2.343 
Variability across time OŽTIME 0.381 — 
Variability across items O TEM 0.113 — 
Person-by-time variability OČ PERSON*TIME 0.919 0.919 
Person-by-item variability O pERSON*ITEM 0.121 0.121 
Time-by-item variability O rIME*ITEM 0.047 — 
Residual variability O ERROR 0.958 0.958 
Total 5.376 4.341 


reliability coefficient in Equation 1. We define the true score variation for the numera- 
tor of Equation 1 in various ways, but estimation of the error variation is based on the 
tendency for subjects to be inconsistent in item responses at different times. Note that 
unlike the classic test-retest approach, Equation 5 does not assume that measurements 
are stable. In fact, the most interesting feature of Equation 5 is its claim that systematic 
person differences at time k can be represented by the term (PT). For example, suppose 
a person is rarely elevated on a Profile of Mood States (POMS) scale with items “angry,” 
“resentful,” and “annoyed,” and that his or her responses to these items are typically 1, 
with an occasional 2. Suppose, on a certain weekday, that he or she responds to all three 
items with values of 4 or 5. This variation is more than we expect from background noise, 
and the variation would be represented by (PT). Had that weekday been one of two mea- 
surement occasions in a simple test-retest reliability assessment, the interesting change in 
overall scale scores would have been absorbed in the error term in Equation 1. 

We illustrate how the variance decomposition can be informative with the real-life 
daily diary example we mentioned earlier. Lane and Shrout (2011) described data on 
daily mood ratings of 221 recent law school graduates preparing for the state bar exami- 
nation (a professional licensing exam). In the current example, we focus on 209 intimate 
partners of the examinees, who also completed daily diary reports of mood while their 
mates were preparing for the exam. Of the 209 partners, 52% were female and average 
age was 30.1 years (SD = 7.7). Like the examinees, the partners completed an abbrevi- 
ated, 16-item version of McNair, Lorr, and Doppleman’s (1992) POMS that assessed five 
mood variables: Anxiety (anxious, on edge, uneasy); Depression (sad, hopeless, blue, 
discouraged); Anger (angry, resentful, annoyed); Fatigue (fatigued, worn out, exhausted); 
and Vigor (vigorous, cheerful, lively). Two times each day (once in the morning and once 
in the evening) for 44 consecutive days participants rated the extent to which they were 
feeling the presented moods “RIGHT NOW” on a 5-point scale, ranging from Not at all 
(0) to Extremely (4). This example is similar in structure to that reported by Cranford et 
al. (2006). 

Table 17.3 shows the variance decomposition of the item ratings for the five moods. 
Between 16% and 35% of the variance of the item responses was categorized as error 
(O’krror), Which means that some people at certain times gave items within a scale unusual 
responses, after taking into account person, item, and time trends. Similarly, between 
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TABLE 17.3. Variance Decomposition and Between- and Within-Person Reliability 


DATA-ANALYTIC METHODS 


Estimates of the POMS-16 Items for Partners in the Bar Exam Study—Crossed Design 


Variance Anxious Depressed 

component mood % mood % Anger % Fatigue % Vigor % 
RR 0.176 28.0 0.060 184 0.054 14.5 0.351 301 0.290 216 
OPTIME 0.008 1.3 0.001 0.3 0.001 0.2 0.036 3.1 0.012 0.9 
Orne 0.009 14 0.008 2.4 0.006 1.7 0.006 0.5 0281 21.0 
Srersonetime 0-195 31.0 0.119 36.6 0.159 42.9 0549 471 0.341 25.5 
O?personrem 1052 8.3 0.024 74 0.022 6.1 0.041 3.6 0.121 9.0 
G?TIME*ITEM 0.001 0.2 0.000 0.1 0.000 0.1 0.001 0.1 0.009 0.7 
ER 0.188 29.8 0113 349 0128 345 0181 15.5 0.286 21.4 
Total 0.630 100.0 0.325 100.0 0.370 100.0 1.165 100.0 1.341 100.0 
Rkp 1.00 1.00 0.99 1.00 1.00 

Rip 0.49 0.35 0.28 0.38 0.48 

Rer 0.98 0.98 0.96 0.98 0.98 

Re 0.76 0.81 0.79 0.90 0.78 


Note. Rx» represents the estimate of between-person reliability in the daily diary context, calculated across persons and 


times. Rç represents the estimate of within-person reliability. 


15% and 30% of variance was associated with the person component (O?pprson): This 
means that some persons rated their mood consistently high or low across times and 
items. This is the variance of interest in standard studies of between-person differences. 
The smallest variance components were associated with the between-item, between-time, 
person-by-item, and time-by-item variance components. Of particular interest is that the 
person-by-time component of variance (07 person=time) is consistently the largest source 
of variance (ranging from 26 to 47%). Averaging across items within a scale, it appears 
that there is consistent evidence of meaningful change across persons. 

Cranford and colleagues (2006) described how these variance component estimates 
can be used to estimate a number of generalizability coefficients, which are variations 
of the classic reliability coefficient in Equation 2. We focus on four versions that are of 
fundamental interest. The first is the reliability of the average of POMS ratings from all 
items and all days for a given scale. This would be the best measure of whether someone 
tends to score high or low on daily mood (e.g., anxiety or sadness), and its reliability will 
benefit from combining m item reports across k time points. In this first version, we con- 
sider the sequence of time points to be fixed, as well as the items. This makes most sense 
when each person completes the diary in the same period relative to one or more anchor 
events, such as starting a new job or a new school year, or approaching an exam, as in 
our empirical example. When everyone has the same items and times, we consider these 
components to be fixed rather than random, and this simplifies the reliability estimate. 
Cranford and colleagues provided the following expression for between-person reliability 
in this case: 


2 2 
Operson + (Operson rrem’) 
Rg = (6) 


2 2 
ison + (Obersonerrem/ M) + (O iano) 
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The subscripts for Rx; indicate that the reliability is an average over k time points (K), 
and that the times are considered to be fixed (F). As one can see from the calculations 
provided in Table 17.3, the reliability of the between-person diary averages (R gp) is excel- 
lent for all five mood scales. In fact, for four of the five cases, the value is larger than .999 
and is therefore rounded to 1.00. Particularly influential is the effect that the number of 
repeated measurements, k, has on the error term. The 88 measurements combined with 
the assessment of multiple items reduced the magnitude of the error term to less than 1% 
of its original size. 

Cranford and colleagues (2006) also considered the possibility that a measurement 
from a randomly chosen day might be used in a study. In this case, the level of the mea- 
sure would be affected by the randomly selected day, as well as the unique way the person 
responded to that random time point. They called the next reliability coefficient R,,, 
where 1R refers to one random time point. In this case, the reliability is calculated using 
the following: 


2 2 
O person T (OPersoN “rrem/) 


Rir = (7) 


2 2 2 2 2 
[a + (Operson*ITeM /m) + OTME $ O PERSON*TIME +(©rRRor/M) | 


Although Cranford and colleagues (2006) did not consider the next variation explic- 
itly, one extension of R,; that is frequently of interest is when subjects have different 
diary periods that are virtually random but the number of time points (k) is the same for 
each period. If the investigator planned to average across the k random time points, the 
variance components for TIME and PERSON*TIME would be divided by k, yielding a 
coefficient we could call R gpg. 


2 2 
Operson + (OpgRson«trem/™) 


Ry = 
(8) 


+ (ae + (Or) + (Oo + (German Ran) 


Comparing the estimates for R gg to those for Rx; and Rz in Table 17.3, we see that 
they always fall in between the other two. This is because the additional variance compo- 
nents related to days in the denominator are divided by the number of days (k). 

Of special interest to many researchers using diary methods is the reliability of day- 
to-day change in mood, closeness, or whatever is the target of research. Cranford and col- 
leagues (2006) provided the following estimator of reliability of change in the case when 
all participants answer the same (fixed) set of m items over a fixed time period: 


o: 
R = PERSON* TIME 
C 


2 2) 0) 
Kenn T (innow 
R. takes the same form as Equation 2, in which true score variability is divided by the 
sum of the person-by-time variability and the error variability. As mentioned earlier, the 
primary interest is the proportion of variability due to changes in ratings over time across 
individuals. Equation 9 assumes that items are fixed, which means that all individuals 
answer the same items at each time point. These items are averaged, so the error varia- 
tion is divided by the number of items. In the last row of Table 17.3 we show the result of 
evaluating Equation 9 with the variance component estimates of the five mood ratings.” 
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In general the within-person reliability estimates were moderate to good. Fatigue was 
estimated to have excellent reliability, while the other four moods hovered around .80, 
which is considered to be moderate reliability (Shrout, 1998). 

Note that the magnitudes of these kinds of reliability estimates are quite different 
in this example, and will often be different in practice. Just because the reliability of 
a between-person assessment might be very high does not mean that the reliability of 
change will be high. Critical to the analysis of change is the relative size of the person-by- 
time variance component relative to the size of the underlying measurement error. As can 
be seen from Equation 9, the reliability of change can usually be improved by increasing 
the number of items. As m gets larger, the value of R, will get larger so long as there is 
evidence of person-by-time variation. For example, if a new study included a fourth mood 
item for anxious mood, the Rç value would be expected to go from .76 to .81.? 

When we introduced the GT psychometric model of Cranford and colleagues (2006) 
in Equation 5, we noted that the model assumes that times and items can be distinguished. 
That assumption is usually justified for items, which are often fixed across time in the 
diary study. However, time points are often arbitrary, and might be best considered to 
be nested within person. In an event-contingent diary design (see Moskowitz & Sadikaj, 
Chapter 9, this volume) two persons might have k entries that are completely unique 
in structure. One person might have k social interactions over different days, whereas 
another might have all k interactions within a single day. The timing and the nature 
of the times makes even the ordering variable of little value. Nezlek and Gable (2001) 
described a psychometric model for nested days and items, and the variance compo- 
nents can be estimated using either multilevel models or variance component procedures 
described earlier. This approach provides three nested variance components: O67 pprson> 
07 +1ME(PERSON)> and O*ERRORTIME(PERSON): All three of the components treat time and items 
as random effects. The reliability coefficients described by Nezlek and Gable are defined 
with the GT tradition to describe the reliability of between-person differences (averaged 
over items and time) and the reliability of within-person time variation (averaged over 
items). We will call the first R gry and the second Rey to distinguish these versions from 
versions discussed earlier. 


Dean ( 1 0 ) 


R = 
KRN 2 2 2 
O PERSON E (OTIMEPERSON) /k) +(OERror/km) 


2 
Roos OT ME(PERSON) 
CN 


2 2 
oh PERSON) + (Orrror/M ) 


(11) 


Table 17.4 illustrates the calculation of estimates when the data are considered to 
be fully nested. The same data used in Table 17.3 were used in this illustration, and one 
can see that the reliability estimates for R gr and R ggy are virtually identical, but the esti- 
mates for R. and Rey sometimes vary. When sources of variation due to item and time 
can be isolated by design, then the Rç will usually be larger than Ray. 

Although the GT approach (whether crossed or nested) provides valuable tools for 
assessing the reliability of change in various conditions, it relies on assumptions that 
might not be well founded in the data. Lane and Shrout (2011) developed an alternative 
method of estimating reliability of change that does not assume items are parallel (i.e., 
have the same degree of association with the true score), or that error variance is the same 
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TABLE 17.4. Variance Decomposition and Between- and Within-Person Reliability 
Estimates of the POMS-16 Items for Partners in the Bar Exam Study—Nested Design 


Variance Anxious Depressed 

component mood % mood % Anger % Fatigue % Vigor % 
Pian 0.194 30.7 0.066 20.2 0.061 16.5 0.365 31.3 0.330 24.6 
Ormererson) 0-183 291 0.112 344 0.150 40.5 0.569 48.8 0.219 16.3 
O? ERROR 0.250 397 0.145 44.8 0.157 42.4 0.229 197 0.697 52.0 
Total 0.627 99.5 0.323 99.4 0.368 99.4 1.163 99.8 1.247 93.0 
Ryrw 0.98 0.97 0.96 0.98 0.98 

Ren 0.69 0.75 0.74 0.88 0.48 


Note. Rxrn represents the estimate of between-person reliability, assuming that items are nested within times, which are 
nested within persons. Rey represents the estimate of the nested analogue of within-person reliability. 


across items or different participants. Their method used a factor-analytic (FA) perspec- 
tive and a generalized omega reliability statistic (McDonald, 1999). Omega takes on a 
form similar to the classic reliability ratio of true score variance to total variance (see 
Equation 2) but allows individual items (1) to be related differentially to the underlying 
construct being measured, and (2) to have different error distributions. In its simplest 
form, the omega measurement model qualifies Equation 1 by multiplying the person 
effect, P, by a factor loading, A,, when fitting X;, the ith item response for person j. In 
addition, the variance of the residual term is not assumed to be the same for all items but 
is subscripted by i. Consistent with the FA literature, McDonald (1999) called that error 
variance Y. When Equation 1 is extended in this way, the reliability estimate is called 
omega and it is written as 


E CEA (SA) (12) 
aired" [Sayed 


Lane and Shrout (2011) show how Equation 12 can be extended to diary data. Essen- 
tially the method involves fitting an FA model to each individual’s time series of data and 
calculating omegas for each individual. Special FA methods for within-person data can 
be used (Browne & Nesselroade, 2005; Browne & Zhang, 2005) when the researcher is 
additionally interested in the autoregressive structure of the temporal process, but similar 
results can be obtained with more standard FA programs, as we illustrate here. 

We carried out factor analyses of the 88 mood reports using available data for 
between 184 and 203 of the original 209 persons, depending on the mood subscales. We 
were not able to carry out analyses for the excluded persons because of limited variability 
over time. Such cases lead to a zero in the denominator of Equation 8 and, as a result, an 
undefined omega. 

Table 17.5 shows summaries of the factor loadings across the individual partners. 
The values are median factor loadings and error variances obtained from within-person 
factor analyses done separately for each person for each mood subscale. We pay most 
attention to the relative size of the loadings within each factor, not their absolute mag- 
nitude, because these are unstandardized loadings rather than usual, standardized load- 
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TABLE 17.5. Summaries of Individual Within-Person Factor Analyses for the Bar Exam 
Partners 


Anxiety Depression 
(N = 195) (N = 191) Anger (N = 184) Fatigue (N = 203) Vigor (N = 189) 


Item À Err. var. À Err. var. À Err. var. A Err. var. A Err. var. 


On edge 0.42 0.11 

Uneasy 0.39 0.06 

Anxious 0.29 0.18 

Discouraged 0.28 0.09 

Blue 0.30 0.04 

Sad 0.30 0.06 

Hopeless 0.05 0.01 

Annoyed 0.44 0.10 

Resentful 0.20 0.05 

Angry 0.29 0.04 

Fatigued 0.66 0.21 

Worn 0.73 0.09 

Exhausted 0.69 0.10 

Cheerful 0.56 0.32 
Lively 0.76 0.12 
Vigorous 0.46 0.23 


Reliability estimates 
@median 0.77 0.82 0.81 0.91 0.83 
(0) 0.71 0.77 0.77 0.89 0.81 


mean 


Note. Each mood scale was analyzed separately. Results and reliability estimates reflect median values (unless otherwise 
indicated). 


ings. We can see that for Anxious mood, “on edge” is more related to daily variation in 
mood than “anxious.” It seems that this scale is dominated by nervousness rather than 
anxiety per se. For Depressed mood, daily feelings of hopelessness are much less central 
than feelings of being blue, sad, or discouraged. Variation in Anger is more characterized 
by annoyance than resentfulness or anger itself. The items for Fatigue have quite similar 
loadings, but daily variation in Vigor is related more to reports of feeling lively than to 
vigor itself. Although the factor analysis was carried out in the service of estimating 
reliability, note that the results that we just reviewed are also relevant to validity of the 
measures, to which we turn in the next section. 

We used the individual factor analysis (see Brose & Ram, Chapter 25, this volume) 
results to compute individual omega statistics. At the bottom of Table 17.5 we show both 
the median and mean values of these distributions. Median values for the reliability esti- 
mates are preferred over means because of a tendency for the distribution of reliabilities 
to be negatively skewed, with the skew primarily due to individuals with little or no vari- 
ability in their item scores. 

In general the Rç estimates of reliability of change based on GT in Table 17.3 and the 
omega estimates are very similar, differing in most cases by less than .01. However, the 
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omega median estimates tend to be slightly higher. In a simulation study Lane and Shrout 
(2011) showed that the GT method underestimates the true reliability to the extent that 
the parallel items assumption (namely, equal item loadings) is violated. In this example, 
the GT method seems to be robust to the minor assumption violations, with the possible 
exception of Vigor, where the FA estimate was larger by .05. 

The methods we have reviewed in this section can help researchers improve measures 
by eliminating noise from the diary variables. Although improved reliability typically 
improves the utility or validity of measures, there is no guarantee that the signal refined 
in the measure is the one that will be the most useful in the empirical research. A thor- 
ough psychometric analysis requires special consideration of validity issues, and we now 
turn to procedures for evaluating validity. 


Evaluating Validity of Diary Methods 


We noted that one advantage of diary methods is that they often involve reporting behav- 
iors, feelings, or reactions that are immediate and transparent in meaning. The apparent 
connection of such reports to the constructs of interest reflects compelling face and con- 
tent validity of the reports. However, a proactive consideration of validity requires that 
the construct be clearly specified, and that its theoretical connections to related variables 
and unrelated variables be explicated. These are the usual steps of construct validation 
(Cronbach & Meehl, 1955) that have been laid out in a number of texts (e.g., Crocker & 
Algina, 1986, pp. 217-239). The main point we need to make here is that both between- 
person measurement variation and within-person variation validities need to be consid- 
ered separately, just as in our discussion of reliability. The validity of between-person 
summaries of diary data is likely to be as good as, or better than, standard cross-sectional 
measures, but the validity of within-person measures will depend on the variability of the 
scores over time and how easily the measurement concept can be interpreted in a daily 
context. For example, in the factor analysis of within-person variation discussed in the 
previous section, we found that hopelessness was not much related to other reports of 
depressed affect. This is likely due to the fact that our participants did not report much 
day-to-day variation in felt hopelessness. 

To illustrate the steps that we recommend for validation studies, we consider a mea- 
sure of anxiety within the context of the relationship. The questions are asked in a similar 
format to that for general POMS Anxiety, but the respondent is asked to focus on the 
context “of your relationship.” Work by mood researchers such as Schwarz and Clore 
(2007) suggests that specific contexts can change the meaning of mood. For this example, 
we are interested in (1) whether there is evidence that relationship anxiety is reliable for 
both individual differences and change, (2) whether relationship anxiety is related to gen- 
eral anxiety (convergent validity) but not redundant (discriminant validity) for both indi- 
vidual differences and change, and further (3) whether relationship anxiety is negatively 
correlated with daily reports of closeness. We might predict that relationship anxiety will 
have a stronger negative association with closeness than will general anxiety. 

We use the same sample of 209 individuals as in the previous analyses. In this exam- 
ple, however, we use only evening measurements because relationship anxiety and close- 
ness were not measured in the morning. Given that reliability is a necessary condition for 
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validity, we first estimate the reliabilities of relationship anxiety and relationship close- 
ness.* Each subscale has two items: Relationship Anxiety (fearful, worried) and Relation- 
ship Closeness (physically intimate, emotionally close). 

The GT method was used to calculate both the between- and within-person reli- 
ability estimates in this case. For relationship anxiety, reliability was estimated to be 
excellent between persons (Rx; = .99) and moderate within persons (Re = .64). Similarly, 
relationship closeness had excellent between-person reliability (Rx; = .99) and moderate 
within-person reliability (Re = .70). The fact that both of these two-item measures display 
reliability of change that is less than or equal to .70 underscores our earlier suggestion 
that more than two items are typically needed to measure change with precision. 

Table 17.6 shows the estimated correlations of average ratings between persons 
(below the diagonal) and the average within-person correlations (above the diagonal) for 
daily general anxiety, daily relationship anxiety, and daily closeness. We see evidence 
for both convergent and discriminant validity between the two different types of anxiety 
at both levels of analysis. At the between-person (individual-difference) level the two 
measures are correlated .59, which is large, but there is still substantial unshared variance 
between the two ratings. When evaluating this large correlation, keep in mind that the 
averages of reports across items and days have nearly perfect reliability. 

At the within-person (change) level there is somewhat less overlap between daily 
reports of general anxiety and daily reports of relationship anxiety. It seems likely that 
there are many days when persons report being anxious both generally and in their rela- 
tionship, but there are also days when persons report feeling anxious generally but quite 
relaxed in the relationship. 

When we consider the different conceptual domain of relationship closeness, we 
again observe correlations in the expected direction. Persons who on average report more 
daily general anxiety also have less average daily closeness (between-person r = —.20), 
and this correlation is similar for averaged daily reports of relationship anxiety (between- 
person r = —.23). Interestingly, the within-person correlations are very similar in mag- 
nitude to the between-person correlations. Regardless of the average level of closeness 
or anxiety, there was a tendency for people’s relationship closeness reports to be lower 
on days with elevated general anxiety (within-person r = -.19) and relationship anxiety 
(within-person r = —.22). The fact that these correlations are very close even though the 
daily measures of anxiety were modestly correlated (within-person r = .35) suggests that 
each kind of anxiety might have a unique association to the assessment of closeness. 

This example illustrates the fact that validity of measures is rarely established abso- 
lutely. Construct validity is best viewed as a “work in progress,” with empirical results 


TABLE 17.6. Between-Person and Within-Person Average Correlations 
for General Anxiety, Relationship Anxiety, and Relationship Closeness 


1 2 3 
1. General anxiety 1.00 35 -.19 
2. Relationship anxiety 159 1.00 -.22 
3. Relationship closeness -.20 -.23 1.00 


Note. Lower half of the table depicts between-person average correlations. Upper half depicts within- 
person average correlations. 
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helping to refine the measures, and better measures helping to clarify the constructs. We 
encourage diary researchers to consider validity of measures in every study and to avoid 
the mistake of concluding that validity is a binary (present, absent) attribute of a mea- 
sure. The utility and meaning of measures often vary from population to population, and 
across social and temporal contexts. 


Conclusions, Extensions, and Further Developments 


Reviewers of articles submitted to top-tier journals routinely ask authors to document 
that the measures used are reliable and display some evidence of validity, but when data 
are collected using diary methods, both reviewers and authors have been confused about 
the appropriate psychometric evidence to present. For example, authors have sometimes 
reported Cronbach’s alpha for diary methods when the primary analysis is a within- 
person analysis of change. We hope that authors and reviewers in the future make a 
clear distinction between the psychometric evidence that is needed when diary results are 
summarized for cross-sectional between-person analyses, and the evidence needed when 
process and change are the central foci of the diary study. 

The psychometrics of measures used in diary studies should always be revisited when 
the study population is different from those used in the past, and when new combinations 
of variables are considered. Both the reliability and validity information about diary mea- 
sures depend on there being sufficient variation. Although there is considerable informa- 
tion in the literature about between-person variation of important constructs, there is less 
information about within-person variation. If an investigator of processes such as attach- 
ment, self-esteem, diligence, and so on, cannot document that there is within-person 
variation in an initial study, it will not be possible to establish the reliability or validity of 
measures. However, rather than concluding in such cases that the measures are flawed, it 
is often worth considering that the reliability and validity of the measures are essentially 
untested. Without variability, there can be no measurement error (since it would create 
spurious variation), but there can also be no evidence of reliability or validity. 

The GT approach of Cranford and colleagues (2006) provides a good first-order 
approximation of the reliability of assessments in diary methods, but sophisticated 
authors should consider the refinements provided by Lane and Shrout (2011) reviewed 
earlier. Although the GT approach makes assumptions that are likely to be violated, the 
implication of these violations is not necessarily enough to invalidate the results of the 
approach. For example, the assumptions of constant variance and equal relevance of 
specific items to the final score are precisely the assumptions that lead many methodolo- 
gists to complain about Cronbach’s alpha analyses of reliability in cross-sectional studies 
(e.g., Sijtsma, 2009). However, when basic procedures for scale development are used and 
items generally assess the same construct, the bias in alpha is often small. 

In addition to the Cranford and colleagues (2006) and Lane and Shrout (2011) 
approaches, there are other, emerging methods for analyzing the measurement quality 
of diary data. Wilhelm and Schoebi (2007) proposed a psychometric model based on a 
multilevel perspective, and their method makes fewer assumptions than the GT approach 
of Cranford and colleagues, but slightly more assumptions than the FA approach of Lane 
and Shrout. As we mentioned earlier, Nezlek and Gable (2001) recommended a multilevel 
model that is appropriate for diary designs, in which the time points and items are strictly 


318 DATA-ANALYTIC METHODS 


nested. The Cranford and colleagues and Wilhelm and Schoebi models assume that items 
and times are crossed with persons, and the Lane and Shrout model only assumes that 
items are crossed with persons. Investigators should choose the reliability coefficient that 
matches the underlying structure of the data under their design. 

Space limitations have prevented us from addressing a number of technical issues. 
One is the impact of missing data on psychometric analyses. It is often possible to obtain 
unbiased estimates of the variance components of the psychometric model when there 
are missing data, but the missingness pattern must be what Little and Rubin (1987) call 
“missing completely at random,” or “missing at random.” See Fitzmaurice, Ware, and 
Laird (2004) for a discussion of these distinctions in the context of multilevel models. 
However, if different persons have different numbers of days or items, then the reliability 
of the measurements will vary by respondent. In circumstances such as this, we recom- 
mend that investigators report the range of reliability values assuming more or fewer 
replicate items and diary days. 

We also note that the methods we have reviewed for reliability are appropriate only 
for quantitative ratings, but often diaries ask respondents to make categorical or ordered 
reports. With binary data the GT assumptions of homogeneous variances are certainly 
violated, but the application of the method will give an initial estimate of the measure- 
ment quality. The FA methods should not be used with binary items. New methods that 
extend item response models (e.g., Embretson & Reise, 2000) to intensive repeated mea- 
sures analyses are in their infancy (see McArdle, Grimm, Hamagami, Bowles, & Mer- 
edith, 2009; Meiser, 1996; Rijmen, De Boeck, & Van der Maas, 2005), but will be more 
available in the future. 

Fortunately, the methods for testing hypotheses about the validity of measures are 
much more accessible. Multilevel models for both quantitative and categorical data are 
becoming well known to diary researchers, and software for sophisticated analysis of 
trajectories and dynamic change is constantly being updated to include the latest meth- 
odological advances. We hope that these methods will be applied in the service of testing 
the quality of diary reports and constructs. This is done automatically when researchers 
report patterns of results that follow predictions. When results are surprising or show no 
clear pattern, there is an opportunity to think critically about the measures of process 
and change. 


Notes 


1. In Equations 1 and 3, the population mean was implicitly included in the between-person score, P, 
but it will be useful in Equation 5 to consider all of the subsequent terms to be deviations from the 
mean. 

2. For the simple numerical example presented in Table 17.1, with variance components listed in Table 
17.2, both the ANOVA and MIXED methods give estimates of Rx); = [2.343 + (.121/3)]/[2.343 + 
(.121/3) + (.958/(3*4)] = .97 for the between-person reliability and Ro = .919/(.919 + .958/3) = .74 for 
the within-person reliability. 

3. By applying Equation 9 with three items, we get .195/(.195 + .188/3) = .76. If we have four items, the 
equation becomes .195/(.195 + .188/4) = .81. 

4. We have already estimated the reliabilities for general anxiety. 

5. Between-person correlations were calculated by first taking each individual’s average ratings across 
the time period for each of the three measures, then correlating those average ratings across the 
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sample. Conversely, within-person correlations were calculated by first correlating each pair of mea- 
sures across each individual’s time series, then taking their average. 
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CHAPTER 18 


A Guide for Data en 


in Experience Sampling Studies 


KIRA ©. MCCABE 
LORI MACK 
WILLIAM FLEESON 


he purpose of this chapter is to recommend strategies for preventing and identifying 

problems in personal digital assistant (PDA) data. We provide a guide for data clean- 
ing procedures, which may promote and facilitate PDA usage in studies that employ expe- 
rience sampling methodology (ESM). Other chapters in this volume focus on the prepa- 
ration for running an ESM study, a discussion of running analyses after the study, and 
consideration of conceptual questions that can be tackled with ESM. However, cleaning 
the data—arguably the most difficult component of ESM studies—also needs consider- 
ation. The literature lacks a clear discussion and standardization of how to detect errors 
in PDA data and how to prepare the data for analysis. 

It is important to clean PDA data properly for at least three reasons. First, careful 
cleaning ensures that PDA studies adhere to the rigorous standardization that is funda- 
mental to psychological research. PDA studies are more technologically sophisticated 
than alternative data collection methods, but the technology does not protect the data 
from error. In fact, the use of PDAs even produces some new kinds of error. Accordingly, 
PDA studies need a set of standards that is commensurate with the scientific rigor we 
require of other methodologies. Second, proper data cleaning enhances the reliability, 
validity, and power of the data. One of the great strengths of ESM is its ecological valid- 
ity, but that validity can be muddled by noncompliance in the data. If the noncompliance 
can be addressed easily—and we suggest it can—the overall validity and power of this 
method can be improved. Third, data cleaning prevents data errors from being com- 
pounded and exacerbated when conducting analyses, particularly analyses of individual 
differences in within-person effects. Errors in the data can mask individual differences 
in within-person effects, and can, conversely, masquerade as individual differences in 
within-person effects. 

We are assuming that readers are familiar with ESM and other daily measures (Con- 
ner & Lehman, Chapter 5, this volume). However, for completeness, by ESM we denote 
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a method that allows for repeated measurement of behavior states within the context of 
individuals’ everyday lives. The cleaning step is sometimes the most difficult step in the 
whole process of conducting an ESM study, but researchers should not allow complexity 
of the data to prevent them from upholding rigorous scientific standards. In this chapter, 
we present several problems we have found when cleaning our own data, along with our 
recommendations for identifying and solving these problems. 

When cleaning PDA data, some problems are similar to those of traditional studies. 
However, because PDA data are not manually entered, those problems may not be caught 
as readily. For example, a paper questionnaire with the same response for all 100 items 
could easily be caught and removed during data analysis. Some unique problems with 
PDA data include participants not completing reports on time. For example, a participant 
carrying a PDA could complete a questionnaire whenever he or she wanted—even at 3:00 
A.M.—despite the researcher’s instructions. 

Cleaning ESM data requires a general investigative attitude, because problems that 
likely exist in the data can be hard to find because of the high number of reports for each 
participant. Our goal in this chapter is to aid the researcher in discovering the problems. 
After discovery of the problems, the researcher needs to use the research notes and infor- 
mation to try to solve the identified problems, fix the data, and record what was done. 
The techniques in this chapter should discover 90% of the problems in a given study, but 
unique conditions in each study mean that researchers have to apply the techniques in a 
thoughtful manner to complete all the necessary cleaning. 

Our approach in this chapter presumes use of certain kinds of technology; specifi- 
cally, we use Palm Z22 PDAs with the Experience Sampling Program (ESP; Barrett & 
Feldman Barrett, 2000), which interfaces with the Palm OS platform. However, the tech- 
niques are generalizable across technologies, with adaptation. 


Cleaning in Advance: Preventing Problems in the Data 


The majority of cleaning steps occur after data collection; however, some preventive mea- 
sures can be taken before and during data collection. 


PDA Software and Study Preparation 
EXPERIENCE SAMPLING SOFTWARE 


Kubiak and Krog (Chapter 7, this volume) compare the primary experience sampling 
software packages. We reference the ESP version 4.0 (Barrett & Feldman Barrett, 2000). 
However, most of these cleaning steps can be adapted to other experience sampling soft- 
ware. Within ESP we opt to use the “take over the machine” mode, with a separate alarm 
program called LookAtMe Version 1.4 (Ezell, 1998), which restarts ESP at the beginning 
of the report after each alarm. 


PARTICIPANT IDENTIFICATION 


It is important to assign each PDA a unique identification number. Within ESP, down- 
loaded files are saved with this participant ID. We write the participant ID on a per- 
manent sticker on the back of the PDA for quick reference and use this participant ID 
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number on any paper questionnaires. If the same PDA is used by multiple participants 
in one study, then the researcher gives each participant a unique ID (perhaps building 
on the PDA ID), uses the unique ID on the paper questionnaires, makes sure to change 
the folder and file names immediately upon downloading data to the correct ID number, 
and changes the value of the ID to the correct ID in the Statistical Package for the Social 
Sciences (SPSS) file when it is first created for the participant (using a compute or recode 
command). In such cases it is also worth having a question in which participants click on 
their unique IDs (as the only option). It is best to keep the participant ID number on the 
PDA constant to track PDA performance across multiple studies and for ease of dealing 
with the software. 


QUESTIONS 


When writing questions, it is important to conduct multiple checks of the question for- 
matting—both before and after the questions are on the PDA. 


SETTING AND PROTECTING THE CLOCK 


ESP generates a time stamp at the beginning of each report from the PDA’s clock; if that 
clock is set manually, and an experimenter is preparing dozens of PDAs, the experimenter 
may set the date and time incorrectly. A false time stamp may lead to incorrect clean- 
ing later. As a precaution, we have opted to use date and time confirmation questions to 
check against the time stamp—one question asks the date (date confirm), and the other 
asks the time (time confirm). This might be less important if the PDA is programmed to 
allow responses at only specific times, but we still recommend it, because it can solve 
problems later, for example, if the PDA clock is wrong, if the battery dies, if the partici- 
pant is able to access the machine at other times, or if other data accidentally get mixed 
in with the text files. If the ESM study runs for over 2 weeks, then the number of dates 
might exceed the number of response options ESM allows. In this scenario, we have 
either used multiple date questions or changed the date confirm question by uploading 
new questions when participants come to download data during the study. We also adda 
“practice” response option to the time confirm question. 

Another concern of a manually set clock is that the participant may tamper with the 
clock in order to fabricate his or her data. In our studies, participants seldom override the 
“take over the machine” mode, and they rarely change the clock. However, researchers 
should be aware of this important concern. Some Palm security programs can be used to 
password-protect this path and prevent participant access to other programs. We have 
been using SecureX Version 1.3 (2006), which password-protects programs the experi- 
menter does not want the participant to access. (We purchased a customized version of 
SecureX that allows participant to access ESP but no other programs for the full duration 
of the study.) 

Although there are advantages and disadvantages to using a fixed schedule (Reis & 
Gable, 2000; Scollon, Kim-Prieto, & Diener, 2003; see also Conner & Lehman, Chap- 
ter 5, this volume), we prefer a fixed schedule for at least two reasons. First, we know 
exactly when a report should be completed and when it should not be completed, which 
is extremely helpful in the cleaning process. Second, a fixed schedule decreases the burden 
on participants. 
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ALARMS 


If a participant is answering questions on a report in ESP when the alarm sounds, ESP 
might stop the report in progress and restart at the beginning of a new report, creating 
an incomplete report and two reports for the same time. A possible solution is to have 
the alarm sound 5 minutes earlier than the instructed time (e.g., 11:55 A.M. for a noon 
report). 


Preventing Problems during a PDA Study 


We usually have participants come by the laboratory at least twice during a 2-week study 
to back up their data, and we keep track of any problems that arise. Careful, precise doc- 
umentation during a PDA study can be extremely beneficial for the later cleaning process. 
One issue with the PDA itself is making sure the battery lasts throughout the study. For 
the Palm Z22, if the battery is depleted for too long, the clock on the PDA might reset. 
An option is to give participants the chargers and instruct them to charge the PDA for an 
hour or so on specific days to ensure that the battery will last for the study duration (but 
beware of overcharging). 

If a participant leaves a time zone during the study (e.g., for a short business trip), 
then we ask him or her to complete the reports on local time. It is essential to document 
the exact times and dates of the trip to correct the time stamps during cleaning; we advise 
that researchers ask participants about any time zone changes at both the beginning and 
the end of the study. The researcher may also want to ask whether roommates or spouses 
are participating in the study together. In one of our past studies, a participant picked 
up another’s PDA by mistake and completed reports on the wrong PDA. In these cases, 
different-colored stickers may be placed on the back of the PDAs for quick and easy 
identification. 


Identifying Problems: Text File Cleaning 


Text File Cleaning Preparation 


For instructions on how to download the data from the PDA, consult the experiment’s 
software user manual (Barrett & Feldman Barrett, 2000). The first preparatory cleaning 
step is to make a backup of the original, unaltered data files. Second, one makes a new 
folder in which to store the SPSS files as one works on cleaning the text files. Third, one 
creates a document to keep track of any cleaning conducted in the text files, and records 
the date and completion time of each of the following steps as they are completed. Good 
record keeping during data cleaning makes it much easier to correct any problems that 
may arise. Problems are indeed likely to arise, given the complexity of the data collection, 
and most likely are not cause for alarm. 


Transferring the Text Files to SPSS 


The researcher must first transfer the data from the text file into an SPSS data file. There 
is one text file after each download, so participants have one file if they download the 
data once, or three files if they download the data three times. For this reason, we rename 


Data Cleaning in Experience Sampling 325 


all files with the participant ID number and a letter in alphabetical order for each file 
(e.g., 449a, 449b, and 449c). 

Before one runs the syntax, some cleaning of the text files should take place. To 
start, one opens the first text file and changes the cumbersome ESP file name to a new 
file name, such as “449a.” The text file is scanned to make sure the data look like Table 
18.1. Each line in the text file represents a single question, and blocks of lines represent 
a single report. The time stamp in Column A represents the date and time at the begin- 
ning of a report, including the year, month, day, hour, minute, and second (in that order). 
The response time in Column D is recorded in ticks (hundredths of a second). Column E 
and F are responses: The participant saw on the PDA screen the label listed in Column 
F, and ESP records in Column E the numerical order of the response (the first possible 
response as a 1, the second possible response as a 2, etc.). Most of the time these match 
up, but they will not in cases in which the question responses are not ordered from 1 to 
the highest number response. Rather than transferring data with string variables in SPSS, 
we choose to transfer only the numerical values in Column E and add the SPSS labels in 
Column F later. 


STEP-BY-STEP SYNTAX EXPLANATION 


Table 18.2 contains the complete syntax to transfer the data from the text files to SPSS. 
After each text file is cleaned, open the syntax and change the “data list file” command, so 
the file route is directed to the text file just cleaned. For all syntax files, the researcher needs 
to change some specific details to correspond to the specific study. These details include 
file paths and names, variable names (variable columns in some cases), numbers of ques- 
tions, participant IDs, changes due to version of SPSS, and other changes. Additionally, 
for repetitive syntax, we provide only one or two examples. We assume that the researcher 
has familiarity with SPSS and can make the necessary changes and extensions. We use the 
data list file command to access the text file, and the records command to create the SPSS 
variables. The “casetovars” command restructures the dataset, changing cases (lines) into 


TABLE 18.1. Sample Text File Format 
A B C D E F 


20090315150433 449 1 546 21 0 “3PM” 
20090315150433 449 2 271 111 0 “Mar15” 
20090315150433 449 3 749 21 0 “2” 
20090315150433 449 4 360 31 0737 
20090315150433 449 5 1227 21 0 “2” 
20090315150433 449 6 170 21 0 “2” 
20090315150433 449 7 218 31 032 
20090315150433 449 8 227 21 0.727 
20090315150433 449 9 517 51 057 


Note. Each column shows a different part of the data that is important to understanding 
the text file. Note that the two columns with “1” and “0” between Column E and Column 
F are not important and should be ignored. The columns give the following information: (A) 
time stamp (taken from the beginning of the report), (B) participant ID number, (C) question 
number, (D) response time (in hundredths of a second), (E) the value of the response, and (F) 
the label of the response. 
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TABLE 18.2. Sample Syntax for Entering Text Files into SPSS: Nonrandomized 
and Randomized Questions 


Nonrandomized questions 


Data list file =’C:/experiment/subject data/449/449a.txt’ 

Records=1 

/1 Year 1-4 Month 5-6 Day 7-8 Hour 9-10 Minute 11-12 Second 13-14 
PalmID 22-24 Question 25-35 ResponseTime 36-49 Response 55-59. 
Execute. 


CASESTOVARS 
/ID=Year Month Day Hour Minute Second PalmID 
/GROUPBY=INDEX. 


ADD FILES FILE=’C:/experiment/subject data/403/403.sav’ 
/FILE=’C:/experiment/subject data/449/449.sav’ 


/FILE=’C:/experiment/subject data/492/492.sav.’ 
EXECUTE. 


Randomized questions 


Data list file =’C:/experiment/subject data/449/449a.txt’ 

Records=1 

/1 Year 1-4 Month 5-6 Day 7-8 Hour 9-10 Minute 11-12 Second 13-14 
PalmID 22-24 Question 25-35 ResponseTime 36-49 Response 55-59. 


Execute. 

RECODE Question (-10=1) (-9=2) (1=3) (2=4) (3=5) (4=6) (5=7) (6=8) (7=9) (8=10) (9=11) (10=12) 
(11=13) (12=14) (13=15) (14= 16) (15=17) (16=18) (17=19) (18=20) (19=21) (20=22) (21=23) (22=24) 
(23=25) (24=26) (25=27) (26=28) (27=29) (28=30) (29=31) (30=32) (31=33) (32=34) (33=35) 


EXECUTE. 
Sort Cases by Year (A) Month (A) Day (A) Hour (A) Minute (A) Second (A) Question (A). 
CASESTOVARS 

/ID=Year Month Day Hour Minute Second PalmID 

/GROUPBY=INDEX. 


ADD FILES FILE=’C:/experiment/subject data/403/403.sav’ 
/FILE=’C:/experiment/subject data/449/449.sav’ 


/FILE=’C:/experiment/subject data/492/492.sav.’ 
EXECUTE. 


Note. This table is the complete syntax used to transfer data into SPSS. The characters should be correct for ESP text 
files; however, double-checking is important to prevent any errors, and some syntax will need to be changed for the 
specific details of each study. The text inside the quotation marks is a sample file route that should be changed for 
individual use. 


variables.! Instead of having one question per line, as the text file does, this command 
restructures the dataset so that each report becomes one case in SPSS. In the new dataset, 
the “casetovars” command matches the cases with the same time stamp and creates a new, 
single, compiled case with allthe responses and response times. After running this syntax, 
the file should name each response variable “Response.1,” “ResponseTime.1,” and so on. 
If multiple participants used the same PDA, add acommand (e.g., compute) to this syntax 
to change the Palm ID to the correct participant unique ID. 
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This syntax will enter the data into SPSS correctly even if there are partial reports. 
Because the response variables have generic names when “casetovars” command restruc- 
tures the dataset, these variable names should be changed to something more meaningful 
prior to running analyses (after concatenating all participants). One limitation is that all 
variables must have the same qualities; in other words, all variables can be numeric and 
a certain character length, or all variables can be string variables. 


ADAPTING THE SYNTAX FOR RANDOM QUESTION ORDER 


Adding a few commands into the syntax (bottom of Table 18.2) makes it possible to put 
random questions in order again for analysis. Before running the casetovars command, 
include a sort cases command by the Participant ID and time stamp to put the questions 
in sequential order. If some questions are randomized and others are not (e.g., if the date 
confirm and time confirm questions are at the beginning of every report), then an addi- 
tional step is necessary. The nonrandomized question numbers are negative numbers, and 
they must be recoded to positive numbers (and all other question numbers adjusted) for 
the casetovars command to work. 

After running the syntax, check the final question number variable to make sure that 
it is always equal to the question number; otherwise the entire file is probably aligned 
incorrectly. The first time the syntax is used, scan the entire series of reports carefully to 
check that each variable is present, and that each value makes sense. Conduct descriptive 
statistics—means, minimum values, and maximum values—for all the variables, and 
check to see that they are reasonable. Variables named “question.#” should have a 0 stan- 
dard deviation and a mean equal to the # in the variable name. For example, all values 
of “Question.9” should be 9. Note. If using randomized question orders, partial reports 
may result in occasional inappropriate values here; this is OK if the number of them is 
small. If problems are detected, then examine the text file using the date of the troubling 
report as a guide, and try to uncover and fix the error in the text file. Note all problems 
and solutions in the cleaning document. After checking the first data file thoroughly, 
the researcher should only need to check the first and last question for alignment for the 
remainder of the files. 

Remove the introductory “practice” report(s) for each participant at the beginning 
of the file. Save the new SPSS file into the new SPSS storage folder. Continue the process 
with the rest of the text files. Once all the individual text files are transformed into SPSS, 
append all the SPSS files together with the add files command (shown in Table 18.2 for 
two participants). The resulting dataset should be all reports from all participants. Save a 
backup and conduct descriptives to check that the entire dataset is aligned correctly and 
that the approximate number of reports is correct (N x number of reports per participant) 
before continuing the cleaning process. Save the syntax file and open a new one. 


Identifying Problems: SPSS Cleaning 


SPSS Cleaning Preparation 


After creating a merged data file, make a date variable that combines the month, day, 
and year variables, as well as a time variable that combines the hour, minute, and second 
variables, using the date and time wizard. These combined variables are string variables, 
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but they make the dataset easier to read and to sort (move these variables to the front of 
the file). Enter value labels for variables as necessary, such as for the time confirm variable 
(e.g., label response option 2 as “3:00 P.M.”). Change values for any variables in which the 
numbers on the PDA screen did not match the numbers recorded in the data—for exam- 
ple, if the response options were 0-5, they will have been recorded in the data as 1-6. In 
such a case, use the compute command to subtract one from each value of the variable. 

To prepare for cleaning in SPSS, make new dichotomous variables for all problems 
and notes in the dataset. Problems are critical issues within the data that make the report 
invalid, such as response times that indicate a report was completed too quickly to con- 
tain real answers. Notes are things to document for a report that may or may not be valid 
and need to be evaluated in more depth. For all reports, we use “1” if the problem or note 
applies to that report and “0” if it does not. Table 18.3 contains a list of common notes, 
and Table 18.4 on page 331 contains a list of common problems. 


Notes to Mark during SPSS Cleaning 


There are six common notes that should be investigated in the SPSS cleaning. Although 
none of these notes should automatically make a report invalid, they all warrant closer 
inspection. 


TABLE 18.3. Common Notes and Syntax 


Note Syntax 
Partial reports Recode Question.30 (Missing = 7777). 
complete syntax If Question.30 = 30 PartialReportNote = 0. 
If Question.30 = 7777 PartialReportNote = 1. 
Execute. 


Participant dropped from study Compute DroppedfromStudy = 0. 


complete syntax If PalmID = 402 DroppedfromStudy = 1. 
Execute. 

Time zone changes Compute timezonechange = 0. 

complete syntax Execute. 


Enter manually (timezonechange = 1) 


Timeconfirm variable does not Rename Variables (Response.1=timeconfirm). 
match time stamp Execute. 
sample syntax Do if (Hour LT 6). 

Compute Hour = Hour+24. 

Compute Day= Day-1. 

If Day=0 Day=28. 

End if. 

Execute. 

COMPUTE timemismatch=0. 

IF timeconfirm=1 AND ((Hour=9 AND Minute<55) OR (Hour GT 

10) OR (Hour LT 9)) timemismatch=1. 


EXECUTE. 


Note. The syntax listed here can be used in SPSS to check for these notes. In the complete syntax, the full syntax 
listed here can be used with minor alterations (e.g., changing the last question number for the partial reports or the 
participant ID number for anyone who may have dropped from the study). The sample syntax shows all steps in the 
process, but additions are needed. For example, the time confirm variable needs the “if” commands for all times 
listed in the study. 
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PARTIAL REPORTS 


Partial reports can be the result of a PDA freeze, an early start to a report that was 
interrupted by the alarm, or a number of other interruptions. They may or may not be 
removed from analysis, but they should at least be noted early in the cleaning. Partial 
reports can be documented by using the syntax in Table 18.3, assuming there are 30 
questions per report (adjust otherwise). A way to check the accuracy of this syntax is 
to sort the file by the variable for the highest question number, which should show the 
recoded value for all partial reports marked in the note. (Whenever sorting, save first 
and return to the saved file. To return to the base order, use the sort syntax and sort by 
ID, date, and time, in that order.) The researcher must decide whether or not to discard 
partial reports. In our judgment, partial reports are valid so long as they do not qualify 
for another problem. Identify partial reports early in the cleaning process to help identify 
and clean other problems. 


PARTICIPANTS WHO DROPPED OUT OF THE STUDY 


Mark a note for all the reports of the participant who dropped out of the study by using 
syntax such as that in Table 18.3. As for the partial reports, this issue is of note because 
the researcher must decide what to do with the completed reports. 


TIME ZONE CHANGES 


If a participant travels to another time zone during the course of the study, the time stamp 
entries must be changed after downloading the data. Find the study documentation that 
explains when the participant was out of town and what the time zone difference was, 
and find the participant’s reports for that time span. Mark these reports with the time 
zone note. Then, manually change the time (and date, if necessary) to match the time 
zone of the other reports in the study. This cleaning step prevents incorrectly marking 
these reports as problems in later cleaning. 


TIMECONFIRM VARIABLE DOES NOT MATCH TIMESTAMP 


In many cases, when the timeconfirm variable does not match the actual time, the partici- 
pant accidentally selected the wrong date or time. In other cases, some participants may 
have tried to make up for missed reports by completing many reports at once. The point 
of creating these notes is to provide helpful information when looking at other problems 
later in the cleaning. 

Rename the appropriate question as “TimeConfirm.” It is best to count early morn- 
ing reports (before 6:00 A.M.) as part of the preceding day, by adding 24 to the hour and 
subtracting 1 from the date. When changing the date, be aware that if the study continues 
across 2 months (e.g., starts in February and ends in March), one must make sure that the 
first day is switched back to the last day of the previous month. Otherwise, the date will 
read, for example, as “March 0” rather than the correct date of “February 28.” 

The timeconfirm check depends on the window in which the participant has to com- 
plete a report; that is, the matching must take into account the whole range of acceptable 
completion times, including minutes and including going earlier in the day if acceptable. 
Table 18.3 contains sample syntax; additional lines of syntax will be needed to cover 
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the other report times. If all participants started and ended the study on the same date, 
make a similar check of the dateconfirm variable and note discrepancies. Checking the 
dateconfirm variable is cumbersome if participants started on different dates, but the 
dateconfirm variable will still be valuable for helping to uncover and resolve problems 
later in the cleaning process. Be aware that if the study has an alarm at midnight, some 
participants may count the date and time differently. 


FURTHER IDIOSYNCRATIC ISSUES 


Finally, some other issues can arise during a study that should be documented in SPSS. 
In this case, this idiosyncratic variable has a label for each issue. For example, in one of 
our studies, a participant was hospitalized for several days. In another case, a participant 
changed the clock after Daylight Savings Time started. These reports may or may not 
need to be removed from the study, but they can at least be flagged in the dataset to screen 
before analysis if needed. 


Problems to Mark during SPSS Cleaning 


Five common problems should be identified during the cleaning. Checking for these prob- 
lems is generally straightforward, although judgment may be needed to address some 
problems. Table 18.4 contains sample syntax for all problems in this section. 


REPORT COMPLETED AT THE WRONG TIME 


This procedure checks to determine whether participants followed protocol and answered 
reports according to the schedule. Generally, we allow responses only for the hour fol- 
lowing the alarm and the 5 minutes prior to the alarm. Using the desired leniency, iden- 
tify the disallowed times, then write syntax to flag all reports completed during the disal- 
lowed times (using syntax similar to the sample in Table 18.4). Continue this command 
structure for every hour in the day, including the middle of the night, so that all reports 
are documented appropriately. 

If the alarms are not set for the top of the hour (e.g., there is an alarm at 3:30 P.M. 
and the participant has until 4:30 P.M. to answer), then the hour and minute should be 
used in an “if” command. To double-check the accuracy of the syntax, sort the reports 
by time and check to see whether everything was flagged correctly. 

This procedure does not work for randomized alarms. It may be beneficial to screen 
out reports that occur outside the specified alarm range (e.g., if there are never reports 
before 10:00 A.M. or after 10:00 P.M., then mark any reports outside this time frame). 
Beyond this suggestion, researchers may have difficulty in adapting these methods to 
check for the wrong time of a randomized schedule. 


REPORT COMPLETED ON THE WRONG DATE 


It is possible that reports from outside the study dates infiltrated the study, due to data 
file errors or to participant misunderstandings. Checking for this problem is simple if all 
participants started on the same date. 
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TABLE 18.4. Common Problems and Syntax 


Problem 


Syntax 


Report completed at the 
wrong time (within the hour) 
sample syntax 


Report completed at the 
wrong date (different start 
dates) 

sample syntax 


Report completed at the 
wrong date (same start dates) 
complete syntax 


Checking in both cases 
complete syntax 


Two reports are too close to 
each other 
complete syntax 


Report completed too quickly 
(checking entire report) 
sample syntax 


Compute wrongtimeprob=0. 

If ((Hour <=9) and (Minute <=54)) wrongtimeprob=1. 
if (Hour=11) wrongtimeprob=1. 

if ((Hour=12) AND (Minute<55)) wrongtimeprob=1. 


Execute. 


Compute wrongdate=0. 
Do if (PalmID=401). 
Compute startmonth=2. 
Compute startday=18. 
Compute starthour=10. 
Compute endmonth=2. 
Compute endday=25. 
End if. 


Execute. 


Compute startmonth=2. 
Compute startday=18. 
Compute starthour=10. 
Compute endmonth=2. 
Compute endday=25. 
Execute. 


If month < startmonth wrongdate=1. 

If ((month =startmonth) AND (day <startday)) wrongdate=1. 

If ((month =startmonth) AND (day=startday) AND (hour<starthour)) 
wrongdate=1. 

If (month > endmonth) wrongdate=1. 

If ((month =endmonth) AND (day>endday)) wrongdate=1. 

Execute. 


Compute day_lag=day-lag(day). 

Execute. 

Compute hour_min=hour*60. 

Compute time_min=SUM(hour_min, minute). 
Compute min_lag=time_min-lag(time_min). 
Execute. 

Compute tooclose=0. 

IF (day_lag=0 AND min_lag<115) tooclose=1. 
Execute. 


If (ResponseTimel < = 50) compute rticheck = 1. 
If (ResponseTime2 < =50) compute rt2check = 1. 


Execute. 

Compute RTsum = SUM(RTcheck1 to RTcheck30). 
Compute RTpercent = RTsum/30. 

If (RTpercent >= .50) toofast = 1. 

Recode toofast (missing = 0). 


Execute. 
(cont.) 
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Problem 


Syntax 


Report completed too quickly 
(removing single questions) 
sample syntax 


Create valid report variable 
complete syntax 


If (ResponseTime.1 <= 50) timeconfirm=55555. 
If (ResponseTime.2 <= 50) Response.2=55555. 
Execute. 

Missing values ResponseTime.1 ResponseTime.2 


Compute ValidData = 0. 
If (wrongtimeprob = 1) ValidData = 1. 


wre (95999) 5 


If (wrongdate = 1) ValidData = 1. 
If (tooclose = 1) ValidData = 1. 

If (toofast = 1) ValidData = 1. 

If (practiceprob = 1) ValidData = 1. 
Execute. 


Identical responses 
sample syntax 


Count NumOnes=Response.2 to Response.30 (1). 
Execute. 

Compute tooidentical=0. 

If NumOnes >26 tooidentical=1. 

Execute. 


Compute InsufficientReports = 0. 

If (PalmID = 407) InsufficientReports = 1. 
If (PalmID = 407) Valid Data = 1. 
Execute. 


Insufficient number of reports 
sample syntax 


Note. The syntax listed here can be used in SPSS to check for these problems. In the complete syntax, the full syntax 
listed here can be used with minor alterations (e.g., changing dates for the wrong date problem). The sample syntax 
shows all steps in the process, but additions are needed (e.g., all response time variables are needed to check the 
problem of reports completed too quickly). 


To check the problem in SPSS, use the syntax in Table 18.4. If participants had dif- 
fering start and end dates, then enter those for each participant as indicated; if partici- 
pants had the same start and end dates, indicate those for all, as indicated. Then, “if” 
commands can be used to document any reports that occurred at a date outside of the 
study. To check the syntax, sort the date and time to look at these reports specifically. It is 
important to determine why there were early or late reports. Some of the early reports are 
just test reports completed by the experimenters. In other cases, the clock may have been 
altered, which means the time stamp is incorrect, or data from a different participant got 
mixed in with the current participant. Recall that if the battery in the PDA is depleted, 
then the clock and the time stamp will reset. Go through these reports to discern why the 
problem occurred, and make sure it is documented as a problem. Extra reports following 
the end of the study are more common. Some participants may keep the PDA for an extra 
day, or may complete extra reports. 


TWO REPORTS ARE TOO CLOSE TO EACH OTHER 


Sometimes reports have time stamps very close to each other. Sometimes, one of these 
reports is not valid, while the other is valid. For example, if the first report is a partial 
report and the second is a complete report with a time stamp a minute after the first 
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report, then the participant may have been interrupted by the alarm or a PDA malfunc- 
tion. Another example is when one report is completed too late, which makes the time 
span between the late report and the next report appear too short. In either case, one 
report is excluded from analysis but the other report is kept; otherwise, the dataset would 
contain duplicate information. 

A second reason for this problem is something much more troublesome to the data: A 
participant is attempting to complete multiple reports at one time. We have encountered 
this noncompliant behavior on some occasions, and usually the participants only change 
the dateconfirm and timeconfirm questions to appear as if they followed the fixed sched- 
ule. This retrospective reporting defeats the purpose of experience sampling methodology. 
Therefore, this issue of close reports can be a serious problem and must be examined. 

Table 18.4 contains the sample syntax to check for this problem. The syntax is some- 
what elaborate, but it avoids false positives in problem detection. First, make sure that 24 
has been added to all early morning report hours, and that 1 has been subtracted from all 
early morning dates (completed during the timeconfirm section). This allows the syntax 
to classify such reports properly. 

Sort all cases by ID, date, and time. Using the lag command, create a new variable 
that subtracts the present report’s day variable from the next report’s day variable. If 
the resulting value is zero, then the next report was completed on the same day. Then, 
convert the time of each report into minutes by multiplying the hour variable by 60 and 
adding the “minutes” variable. Use another lag command to subtract the present report’s 
total minutes from the next report’s total minutes. Before continuing, determine the clos- 
est that two successive reports may be, in minutes, and still be legitimate. For example, 
if reports are 3 hours apart but participants are allowed to be 1 hour late or 5 minutes 
early, then the closest two successive reports can be and still be legitimate is 115 minutes. 
Use this cutoff in another “if” command to detect close reports. Even if it is legitimate 
to complete back-to-back reports, it is advised to catch all reports completed within 15 
minutes of each other and be sure they are legitimate. 

This syntax serves only to detect the problem, whereas judgment determines which 
reports are valid and invalid. At this point, take some time to investigate and fix all 
flagged problems. There are several reasons for close reports, and we do not want to 
throw out valid data by mistake. Keep in mind that one possible cause is different partici- 
pants who may accidentally have the same ID. By default, the syntax we provide marks 
the second report as the problem—even if the problem is actually with the first report. 
Examine the data closely and determine which reports are valid and invalid. 


REPORT WAS COMPLETED TOO QUICKLY 


This problem detects whether or not participants answered questions too quickly to be 
complaint data. One of the strengths of collecting data with the PDA is that we can check 
the response time for each question. Before we can start checking for fast reports, we must 
define our cutoff for a fast report. How fast is too fast? Currently, there is no standard for 
a fast report, particularly since questions can vary in length. Past chapters on experience 
sampling methodology have recommended checking response times (Conner, Feldman 
Barrett, Bliss-Moreau, Lebo, & Kaschub, 2003; Conner, Feldman Barrett, Tugade, & 
Tennen, 2007), and have suggested removing reports faster than 10-30 milliseconds. 
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This threshold may be conservative, and it may not detect all reports completed quickly. 
We investigated some of our past ESM data to determine an empirically based cutoff. 

First, we combined the response times of all questions of all participants in one of 
our studies (N = 80 participants). In this study, participants included younger adults, 
middle-aged adults, and retired adults (Study 1 in Noftle & Fleeson, 2010). We looked 
at a restricted range histogram of response times less than 150 ticks (hundredths of a 
second, or 1.5 seconds) to determine whether there was a categorical cutoff (Figure 18.1). 
Unfortunately, there was not a clear separation to distinguish valid and invalid reports. 
The histogram frequency increased between 50 and 100 ticks (0.5 seconds to 1 second), 
but there was no clear division for our cutoff. 

Second, we had a few laboratory members test their speed on the PDAs. After com- 
pleting several practice reports to familiarize themselves with the questions, they com- 


Question Response Time 


Frequency 


0 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 


Response Time (ticks) 


FIGURE 18.1. This histogram shows the lower range of response times for all participants for all 
questions in one ESM study; there were 20,205 responses all together. Although there is no clear 
distinction between response times that are too fast and response times that are valid, our recom- 
mended cutoff is 50 ticks (hundredths of a second, equivalent to 0.5 second, or 500 milliseconds), 
indicated by the black line. 


Data Cleaning in Experience Sampling 335 


pleted reports as fast as they could while still responding honestly. The fastest time for 
any response was 104 ticks. This simple pilot test was mainly a benchmark to interpret 
our other findings, and some people may be able to provide some responses faster than 
100 ticks (1 second). With all this evidence, we opted to set a conservative estimate of 
50 ticks (0.5 seconds, or 500 ms) as our cutoff because we are confident that most of the 
data under 50 ticks are invalid. Ultimately, the researcher must decide how stringent to 
be when deciding the cutoff for fast response times. 

There are two parts to the response time cleaning. The first part is identifying and 
removing entire reports that are too fast, and the second part is identifying and removing 
individual responses that are too fast. For the first part, we decided to remove reports 
if at least 50% of the questions were answered too quickly. Table 18.4 shows a sample 
of the syntax needed to check an entire report. Two “if” commands are needed for each 
variable, creating a new variable for response times that are valid (above the cutoff) and 
invalid (below the cutoff). Make sure there are no missing values for any of these vari- 
ables. Then, compute a variable that takes the sum of all the dichotomous response time 
variables. Use this variable to compute a percentage variable, dividing the sum by the 
number of questions. If a participant answered 10 out of 30 questions too quickly, then 
the sum should be 10 (twenty “0” values and ten “1” values). When this sum is divided 
by the number of questions, the resulting value is .33, which indicates that 33% of the 
responses in that report were faster than the cutoff. 

For the second part, we decided to remove individual questions that were completed 
too quickly. Even if the whole report is valid, individual responses should be excluded 
from analyses if they were not answered properly. Several of our participants informed 
us that they accidentally clicked a button twice in a row, which means the fast response 
is simply an error that needs to be removed. To correct this problem, we overwrite the 
response given by a participant. Before completing this step, we suggest backing up the 
current dataset just in case there is an error in the syntax. Use “if” commands to com- 
pute new values for responses that were too fast. This new value should not be a possible 
response in the data (e.g., 55555). Do this step for all response time variables, with the 
exception of the date confirm and time confirm questions. After completing this syn- 
tax, check the variables to see whether the changes were made correctly. Finally, for all 
response variables, mark the new value (e.g., 55555) as a discrete missing value, which 
can be done in the variable view in SPSS. 


PRACTICE REPORTS DURING THE STUDY 


This procedure removes any practice reports. Occasionally, a participant chooses “prac- 
tice” for the time confirm question during the course of the study, which may be a genu- 
ine practice or a misclicked valid report. We cannot be sure which. Scroll through the 
data and investigate each report with “practice” as the timeconfirm response. At least 
three clues suggest that a participant was practicing: (1) The report has response times in 
the ambiguous range (less than 100 ticks or 1 second); (2) the report was completed too 
close to another report; or (3) all the questions have the same response, which is a general 
sign of noncompliance (Conner et al., 2003). Based on the collective evidence from these 
indicators, set practiceprob equal to 1 if it is likely to be a nonvalid report (after first set- 
ting practiceprob to 0 for all reports). 
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TOO MANY IDENTICAL RESPONSES 


If almost all responses in a single report are the same number, it is very unlikely that the 
report is valid. Count the number of each response in each report, then indicate a prob- 
lem if greater than 90% of the responses are the same, as shown in the sample syntax. 


After Identifying Problems and Notes 


After cleaning the data in SPSS, all the problems and notes indicate which reports to 
keep and which reports to remove from analyses. To do this step, make a final dichoto- 
mous variable that marks a report as valid or invalid. Then, go through the problems to 
decide the specific conditions for a valid report. If a report is marked as a problem but 
investigation reveals it to be a valid report, be sure to change the problem variable to a 
0, so that this report stays in the data. Finally, use “if” commands for each problem and 
note whether to make a report valid or invalid. This invalid variable provides an easy 
way to select the data that one wants to use in the analyses by simply using the “select if” 
command for valid reports; however, do not save the dataset after using this command, 
because the invalid data will be removed permanently. 


PARTICIPANT HAS INSUFFICIENT OR TOO MANY REPORTS 


One of the advantages of using experience sampling methods is to collect many reports 
over a given period. If there are only a few reports, then there is not enough information 
available to run analyses. Therefore, the researcher should determine how many valid 
reports each participant has. Under the Crosstabs menu, select the invalid variable (col- 
umn) and the participant ID variable (row). This output should provide the frequency of 
participants’ valid reports, invalid reports, and total reports. This table can be copied 
into Microsoft Excel (or into another SPSS datasheet, if desired). Determine the number 
of possible reports a participant could have completed during the study. Then, divide the 
number of valid reports by the total possible reports to determine the percentage of valid 
reports for each participant. The researcher should decide what the appropriate cutoff 
should be. One cutoff might be that any participant who has less than 20% of reports is 
removed from analyses. If any participants are removed from the study, go back to the 
SPSS file and mark these reports with sample syntax in Table 18.4. Be sure to make all the 
participants’ reports invalid, so they will not be included in the analyses. 

Too many reports may occur if two participants had the same ID or the Palm ID 
was not changed to the correct unique ID. Investigate any cases in which the number of 
reports is too large, and resolve and document the solution. 


ORGANIZING THE DATA FILE 


In data view, type in meaningful names for each of the response variables. Save the 
syntax, the data, and a backup of the data. Then, delete all the following variables: 
question.# variables, responsetime.# variables, rt#check variables, num#s variables, and 
variables used to create lags. Be sure that the numbers listed in the response variables 
correspond exactly to the numbers participants clicked on the PDA screens. Save again, 
and one is ready to begin. 
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Discussion 


Data Analysis 


After conducting all of these cleaning steps, the dataset is ready for running analyses. 
This cleaning process is time-intensive, but it is worth the time invested. Be sure to make 
a backup file of the dataset before starting the analyses, in case the dataset is saved 
after using the “Select If” command. Several ESM experts have discussed data-analytic 
approaches with ESM data (Bolger, Davis, & Rafaeli, 2003; Nezlek, 2008; Reis & Gable, 
2000). There are also a couple of useful guides that have a step-by-step explanation of 
how to run multilevel modeling analyses within SPSS (Fleeson, 2007; Nezlek, Chapter 
20, this volume). The data may take some time to collect and clean, but there are many 
possible analyses available once the dataset is complete. 


Conclusion 


The procedures outlined in this chapter serve to help researchers to maximize the expe- 
rience sampling data collected by PDAs. We may have come a long way since the early 
paper-and-pencil methods of ESM, but researchers must adapt their data cleaning stan- 
dards to this new technology. Even if researchers do not use the exact methods and cut- 
offs recommended in this chapter, we hope to have elucidated several important issues 
that may arise in the data. Every method has strengths and limitations, and the unique 
issues with experience sampling methods should be considered before running analyses. 

We also hope that this chapter serves as a guide for people who wish to use this 
technology. This chapter does not explain the entire process of running a PDA study, but 
it supplements past articles that have aided researchers (e.g., Conner et al., 2003). This 
cleaning process is perhaps more difficult and more arduous than is apparent at first, but 
it is necessary for applying rigorous standards to PDA studies. We hope that this guide 
reduces the complexity of the cleaning process by helping researchers to negotiate a PDA 
study from the point of collecting data to beginning substantive analyses. 
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Note 


1. Lacroix and Giguere (2006) provide a guide about restructuring data in SPSS for additional reading 
on this process. 
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CHAPTER 19 


Techniques for Analyzing 
Intensive Longitudinal Data 
with Missing Values 


ANNE C. BLACK 
OFER HAREL 
GREGORY MATTHEWS 


Be studies are unique in their requirement of respondents to provide data 
repeatedly over time. Such studies may require multiple responses in a day, as may 
occur with experience sampling (or ecological momentary sampling), or may require 
participants to respond less frequently over a longer period of time. Particular threats 
to repeated measures studies, and specifically those involving momentary assessment, 
are fatigue, forgetfulness, noncompliance, and dropout. The result is unplanned missing 
data. Reviews of assessment completion in intensive longitudinal studies have reported 
noncompliance rates close to 90% of observations (e.g., Stone, Shiffman, Schwartz, Brod- 
erick, & Hufford, 2003), and increasing loss of data over the course of studies (e.g., Brod- 
erick, Schwartz, Shiffman, Hufford, & Stone, 2003). The use of diary data has become 
a popular strategy to capture in situ information, and has advantages over retrospective 
calendar methods. However, high rates of missing data can be particularly problematic. 
Although recommended guidelines to minimize missing observations in over-time studies 
are widely available (e.g., Boys et al., 2003; Chang, Yang, Tang, & Ganguli, 2009; Huf- 
ford & Shiffman, 2003; McKnight, McKnight, Sidani, & Figueredo, 2007; Ribisl et al., 
1996; Wisniewski, Leon, Otto, & Trivedi, 2006), some degree of nonresponse is all but 
inevitable. In addition to participant-based causes, data may be missing due to equipment 
failure, data entry error, lost data, and chance. 

In this chapter, we discuss the problems and causes of missing data in studies of daily 
life that involve repeated measures, recommend techniques to prevent or minimize the 
occurrence of missing data, outline how analysts can determine what assumptions are 
most appropriate for their data, and review typical and best practices for handling incom- 
plete intensive longitudinal data. We present a motivating example, using a subset of data 
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collected by Conner (2009), in which university students were asked to provide daily 
diary data about their alcohol use and well-being over a 21-day period. For our example 
model, positive mood (Mood) for 263 students was measured daily for 21 days, and was 
predicted by day in the study, headache, stress, weekend day, and number of drinks the 
previous night. In the example, we impose missing values on two variables and illustrate 
model parameter estimation with various missing data (MD) techniques, demonstrating 
the impact of MD and the utility of informed methods for deriving unbiased and efficient 
estimates. 


The Linear Mixed Model 


Throughout the chapter, we discuss the analysis of incomplete intensive longitudinal data 
with the linear mixed model (also called hierarchical linear model, multilevel model, and 
random effects regression model). To review, the general two-level model is represented 
as follows: 


P 

Level-1 model: Y; = To; + ZX pu +e; 
Q 

Level-2 model: T= Bpo + x Bp Wy +r; 


where Y, is the outcome for person i at time Z, Tọ; is the level-1 intercept estimate for 
person i, X,,; is the pth level-1 covariate with value specific to person i at time 1, n,, is the 
effect of the pth covariate for person i, and e, is the error for person i at time £ (the vari- 
ance is 6”). At level 2, m,; is modeled as a function of ß,o, the level-2 intercept; W}; the 
collection of © level-2 covariates with values specific to person i; B,,,, the corresponding 
effects of W,; and r,; the level-2 error term for person i (variance is T,,,). In this chapter 
we refer to variables measured at each occasion as “level-1,” and time-invariant, person- 


level variables as “level-2” covariates. 


The Problem of Missing Data 


Missing data are not inherently problematic. Threats to the validity of statistical infer- 
ences arise when MD are handled inappropriately, or when rates of missing information 
are very high. Even very small proportions of incomplete cases can lead to substantial 
missing information. To estimate the proportion of total cases available for use with 
complete case analysis, let a be the rate of random missing values in each variable and 
let b be the number of variables. The proportion of complete cases is equal to (1 - a)’. 
For example, given a dataset with 20 variables and 10% random missing values in each 
variable, we should expect (1 - 0.1)?° = 0.12, or approximately only 12% of cases with 
complete data. The result of poorly informed MD handling is often loss of statistical 
power, and biased and inefficient parameter estimates that may lead to incorrect conclu- 
sions about the nature of variable relationships in the population. 

Analyses of determinants of MD in social science studies have revealed that incom- 
plete study participation and survey item nonresponse may be predictable from impor- 
tant variables such as gender, age, race, and education level (see, e.g., Cranford et al., 
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2008; Kupek, 1998; Suchman & McCandless, 1940). Under certain conditions, some 
subgroups may have a higher probability of nonresponse than others. To the extent that 
these subgroups differ from groups with complete data on variables of interest, results 
from analyses of complete case data will be biased in the direction of the responding 
groups’ scores, and will lack generalizability to nonresponders. Despite this fact, one 
of the most commonly used MD techniques is case deletion, or the removal of cases 
from the analysis that exceed an analyst’s predetermined acceptable rate of missingness 
(e.g., Johnson et al., 2009). This procedure makes use of complete (or less incomplete) 
case data to the exclusion of cases with (more) incomplete data. When study-compliant 
participants differ substantively from noncompliant participants, statistical conclusions 
drawn from the selected data will be misleading. 

It is important to understand the nature of missing values in any study in order to 
select a MD technique that is suited to the condition. Selection of MD techniques should 
be done with the primary goal of preserving the distributional characteristics of the vari- 
ables of interest, as well as their interrelationships with other variables (Schafer, 1997; 
Schafer & Graham, 2002) for the purpose of deriving valid and meaningful inferences 
from the available data. Failure to preserve these variable characteristics will result in 
incorrect statistical conclusions. 


Preventing MD 


The occurrence of MD is well documented. There is no dispute about its ubiquity, par- 
ticularly in studies involving repeated measures. In response, several publications have 
outlined methods to prevent or reduce nonresponse and attrition (e.g., Broderick et al., 
2003; Hufford & Shiffman, 2003; McKnight et al., 2007; Ribisl et al., 1996; Stone et 
al., 2003; Wisniewski et al., 2006). Because prevention of MD is the best way to ensure 
quality parameter estimates, we outline these recommendations first. Guidelines can be 
grouped into the following general themes: 


1. Outline clear instructions for schedules and completion of assessments. Methods 
for completing instruments and scoring items should be well defined, with specified time 
periods within which assessments should take place. 


2. Train study personnel in data management procedures and review data collec- 
tion regularly. The importance of study retention and complete data should be stressed 
with study staff. Data should be entered regularly for early identification of problems or 
errors. Data quality-control procedures should be conducted routinely, checking data 
entry for accuracy and completeness. 


3. Conduct pilot study. A pilot study will provide early information about feasibility 
of participant recruitment and the functioning of data collection instruments. Problem- 
atic items, burdensome assessments or schedules, and issues affecting retention can be 
identified and addressed before beginning the full study. 


4. Provide prompts and feedback. Studies showed improved compliance with data 
collection schedules when prompts to complete assessments and feedback about compli- 
ance rates were provided. Use of electronic diary data with “enhanced compliance” fea- 
tures including real-time reminders has been associated with high response rates. 
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5. Maintain regular contact with study participants/identify location information. 
Guidelines suggest gathering rich contact information early in the study, and reminding 
participants of scheduled assessments (see Conner & Lehman, Chapter 5, this volume). 
Difficulties in tracking participants should be reviewed routinely. Procedures should be 
outlined to follow up with participants who do not complete scheduled assessments. 


6. Design assessment equipment with “livability functions.” Because assessment 
completion is built into participants’ daily routines, options to turn off audible reminders 
for sleep and quiet activities should be provided. 


7. Ensure the proper use and functioning of equipment. Electronic diaries and 
global positioning system (GPS) devices may be used to record observations. It is impor- 
tant to ensure that participants are properly trained in the use of such equipment, and 
that equipment is well maintained to minimize failure. User interfaces should be intuitive 
and easy to use. 


8. Minimize response burden. Limiting measurement occasions and number of vari- 
ables reduces the demand on responders and the probability of missing data. Assessments 
should be no more complicated than necessary to obtain the information needed. 


Anticipating the occurrence of MD, and planning and implementing procedures 
to minimize nonresponse and attrition are critical to the conduct of longitudinal stud- 
ies. However, to the extent that these strategies do not eliminate MD completely, the 
researcher should be well informed about how to proceed in the face of incomplete obser- 
vations. 


Current Trends in Longitudinal Studies with Missing Data 


A recent review of MD practices in studies of behavioral development, published over a 
6-year period (2000-2006; Jeličić, Phelps, & Lerner, 2009), examined the reporting of 
MD and techniques applied. The authors noted evidence of MD in over half of the stud- 
ies, and the vast majority of those studies used case deletion techniques. A very small 
percentage of studies applied more principled techniques of maximum likelihood esti- 
mation and multiple imputation, and the authors concluded that the use of these better- 
performing techniques has not significantly increased in the past decade. Our own non- 
systematic review of electronically available articles describing studies of daily life or use 
of experience sampling methods revealed a similar trend, with a majority of those studies 
reporting some use of case deletion. 


MD Mechanisms 


The processes by which MD occur have been categorized with respect to the relationship 
between the probability of missingness and variables in the dataset. These processes, or 
mechanisms, have important implications for the analytical techniques that provide valid 
statistical inferences. Three mechanisms have been described in the MD literature (Little 
& Rubin, 2002; Rubin, 1976): missing completely at random (MCAR), missing at ran- 
dom (MAR), and missing not at random (MNAR). 
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MCAR 


Missing observations may result from a completely random process whereby missingness 
is unrelated to values in the dataset, either missing or observed (Little & Rubin, 2002). In 
this condition, called MCAR, the observed data are a random subset of the hypothetical 
(but unobserved) complete dataset, and are therefore representative of the hypothetically 
complete set. Consider the Mood study example described earlier. If students occasion- 
ally forgot to complete diary entries, and forgetting was unrelated to day of the week, 
number of drinks consumed, mood, or any other relevant variable, the nonresponse 
would be MCAR. MCAR is the most benign condition, in that the means and variances 
of the would-be complete data are preserved. Analyses of the available data will provide 
accurate and unbiased, but less efficient, estimates. 


MAR 


Data are MAR if missingness on a variable can be accounted for, or explained, by other 
variables included in the dataset. These nonresponse-relevant variables may be covari- 
ates of the variable with missing data, or earlier observed measures of the variable itself. 
Referring to the Mood study example, if participants were more likely to skip diary 
entries on weekends (e.g., because of less structured routine, or less convenient access 
to Internet to log entries), missingness would be accounted for by the observed variable 
“weekend.” Because missingness can be fully explained by the weekend indicator vari- 
able, the nonresponse mechanism is MAR. However, as we discuss later, this condition 
holds only if the variable(s) accounting for missingness is observed and included the esti- 
mation procedure. If missingness-relevant variables are unobserved, or are omitted from 
model estimation, missingness is then not at random (MNAR). 

The MAR assumption cannot be verified with unplanned missingness without 
follow-up with nonresponders, but in many cases it is a reasonable assumption (Schafer 
& Graham, 2002) and there are methods to make the assumption more plausible, as we 
discuss later. 


MNAR 


Values are MNAR when the missingness is a function of the unobserved values them- 
selves, even after controlling for other observed variables. In other words, values are 
unobserved because of what those values would have been, had they been observed. In 
the Mood study example, if the “Ndrinks” variable were left blank more often by par- 
ticipants who drank heavily, even after accounting for weekend day and other covariates, 
then unobserved NDrinks values would be MNAR. Clearly, statistical inferences derived 
from the available data would be unrepresentative of nonresponders. 


Ignorable Missingness 


Rubin (1976; see also Little & Rubin, 2002) coined the terms “nonignorable” and “ignor- 
able” to refer to whether the process accounting for missing values must be modeled 
within the substantive analysis. This is not the same as suggesting that, in some cases, 
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MD can be disregarded—the analyst should never ignore the fact that data are missing. 
The difference between ignorable and nonignorable missingness has to do with whether 
model estimates will be proper if only the data model (the model of substantive inter- 
est) is specified, or whether an additional model that explains the nonresponse must be 
included. 

Missingness is ignorable when (1) data are MAR and (2) the parameters of the data 
model are independent of the parameters of the missingness mechanism (Schafer, 1997). 
In other words, the model defining the process of missingness has no bearing on the data 
model parameter estimates—they are completely independent processes. 

If the analyst uses a likelihood-based or Bayesian estimation technique to model 
repeated measures, MAR and MCAR can reasonably be treated as ignorable conditions 
(Schafer & Graham, 2002); no additional information in the model is required about 
nonresponse (Rubin, 1976). 

Conversely, if the analyst uses a semiparametric technique for nonlinear modeling 
(e.g., when outcomes are dichotomous or ordinal), such as generalized estimating equa- 
tions (GEE; Liang & Zeger, 1986), only the MCAR condition is ignorable. Because GEE 
relaxes model assumptions about the random component of the mixed model to accom- 
modate non-normally distributed errors, stricter assumptions are required about MD— 
that is, missingness must be a completely random process (MCAR). 

Always, if data are MNAR, the condition is nonignorable. Nonignorability requires 
complex modeling in that both the process of substantive interest and the process account- 
ing for nonresponse must be modeled. In this chapter, we focus on options for conditions 
of ignorable missingness, and offer a brief overview of options for modeling incomplete 
data with nonignorable missingness. Readers are referred to additional resources on the 
topic for more detailed information. 


Testing Assumptions about Nonresponse 


Whereas it is not possible to affirm statistically that data are MAR or MNAR because 
the unobserved values are not available for such testing, the analyst can test the assump- 
tion of MCAR. 

To determine whether nonresponse on a given variable is related to other observed 
variables, or whether nonresponse is a completely random process, one can compare 
the observed variable means for responders and nonresponders. This is done by cre- 
ating a missingness indicator for the variable with MD; missing values may be coded 
0 and observed values, 1. T-tests are conducted, comparing the means of participants 
with incomplete data to those with complete data on other variables in the dataset. For 
example, participants with and without missing values on NDrinks could be compared 
on ratings of Headache, Stress, and Mood. Significant group mean differences on any 
of those variables would indicate values on NDrinks are not MCAR; there are system- 
atic differences between participants with and without missing data. For multiple t-tests, 
Little (1988) proposed an omnibus chi-square test for MCAR to control for inflated Type 
I error rates. The chi-square test is available in the Statistical Package for the Social Sci- 
ences (SPSS) Missing Values Add-On package (SPSS, Inc., 2010). If the probability of 
observed differences is p < .05, the pattern of nonresponse departs significantly from the 
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MCAR assumption, and the analyst should proceed with the assumption that values are 
either MAR or MNAR. 


Increasing the Plausibility of MAR 


Because modeling incomplete data is much more straightforward when data are MAR 
than when MNAR, measures should be taken to increase the plausibility of the MAR 
assumption when it is not a planned condition. Including in the model (the imputation or 
analytical model) variables that account for missingness or are strongly correlated with 
the incomplete variable, whether or not they are of substantive interest, accomplishes this 
goal (Collins, Schafer, & Kam, 2001; Little & Rubin, 2002). Inclusion of nonresponse- 
relevant auxiliary variables increases the likelihood that covariates of missingness are 
controlled for (such that any remaining variance in missingness is nonsystematic), and 
reduces the probability of bias in parameter estimation. Using the Mood example, if 
number of classes, having a job, or even responses to the question “How likely are you to 
forget some diary entries?” were related to the probability of missingness on one or more 
variables, including these in the model would improve parameter estimation. If prob- 
able causes of missingness can be identified in advance, and measured, then this creates 
an analytical advantage. The effort invested in the planning and measurement of these 
variables is often well rewarded. In later sections, we outline how these variables can be 
incorporated into models to improve estimation. 


Ad Hoc Approaches to MD 


MD techniques have been classified as ad hoc or principled. Ad hoc approaches such 
as complete case analysis and single imputation are popular methods because of their 
simplicity. However, the techniques are not based on statistical theory, and their use can 
result in misleading inferences. 


Case Deletion 


Case deletion, also called complete case data analysis, involves removing from the analy- 
sis all cases with any MD. While the technique is straightforward and simple to apply, 
allowing the researcher to proceed with complete-data statistical routines, unless the 
complete cases are a random subset of all cases in the sample (that is, unless the data are 
MCAR), this approach may yield results that are biased (Little & Rubin, 1989, 2002). In 
addition, the technique may result in substantial reduction in sample size (see “The Prob- 
lem of Missing Data” section) and corresponding loss of power and estimation precision 
(Little & Rubin, 1989, 2002). Reduced sample size may impose limitations on the types 
of analyses that can be conducted, and may preclude the use of large-sample techniques. 
With a very small proportion of MD, complete case analysis may be justified (Graham, 
Cumsille, & Elek-Fisk, 2003; Little & Rubin, 2002), but even in this case, the analyst 
must consider to what degree the fully observed and partially observed case subsets differ 
on parameters of interest (Little & Rubin, 2002). 


346 DATA-ANALYTIC METHODS 


A variation of casewise deletion involves removing cases with considerable MD, while 
retaining cases with no, and lower rates of, MD (e.g., Johnson et al., 2009). Although this 
method allows retention of more cases for analysis, it, too, introduces risk of bias when 
data are not MCAR, and reduces power, as illustrated in our example. 


Single Imputation 


Single imputation involves the replacement of missing values with estimated values, 
according to some algorithm. Only one plausible value is imputed per missing value. 
Methods for deriving imputed values are many, including mean, regression, and hot deck 
imputation, and last observation carried forward (LOCF). In (unconditional) mean impu- 
tation, the mean of available cases replaces all missing values on a given variable. Regres- 
sion imputation involves regressing the partially observed variable on other variables in 
the dataset, and replacing missing values with predicted values. With hot deck imputa- 
tion, the observed value for a similar case is imputed. The LOCF technique imputes the 
last observed value in a given variable for all subsequent missing values. 

These techniques are easy to apply and allow the analyst to analyze the completed 
data set, unrestricted. However, single imputation is problematic in that failure to account 
for uncertainty about the true value, and treating the imputed value as if it were observed, 
may result in overestimated precision. The results may be inflated Type I error rates and 
misguided statistical inferences (Little & Rubin, 1989). 


Principled MD Methods Assuming Ignorable MD 


Statistical and empirical evidence has established that principled MD techniques, includ- 
ing maximum likelihood (ML) estimation algorithms (Dempster, Laird, & Rubin, 1977) 
and multiple imputation (MI) (Rubin, 1987), provide more accurate and efficient estimates 
than complete case analysis and single imputation (see, e.g., Little & Rubin, 1989, 2002; 
Schafer, 1997; Schafer & Graham, 2002). Additionally, these principled techniques may 
be applied under less restrictive MD assumptions. Principled methods include likelihood- 
based and Bayesian estimation techniques, and MI. 


Likelihood-Based Methods 


The default estimation procedure for longitudinal data models in many commonly used 
statistical software packages is maximum likelihood (ML). With these estimation algo- 
rithms, the parameters that have the greatest likelihood of producing the observed data, 
given the specified model, are identified. Maximum likelihood estimators are the model 
parameter estimates that maximize the likelihood of the data (for details about MLEs, 
see Dempster et al., 1977; Hox, 2002; Little & Rubin, 1989, 2002). 

MLE does not require observations to be balanced; individuals may have differing 
numbers of observations spaced at different intervals, which makes MLE well suited for 
intensive longitudinal designs. All complete and partially observed cases contribute to 
the MLE of model parameters, and the MD values are treated as random variables to 
be averaged over (Collins et al., 2001). Given a properly specified model, ML parameter 
estimates from incomplete longitudinal data will be unbiased and efficient when miss- 
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ingness is ignorable (Little & Rubin, 1989, 2002). It is important to note that ML is a 
large-sample technique. As such, estimates may not be reliable when ML is used with 
small or moderate-size samples, and it may be necessary to consider other methods in 
this case. 


Including Auxiliary Variables in the MLE Model 


Auxiliary variables are variables that can be included in the model to improve estimation 
in the presence of MD, although they are not typically of substantive interest (Collins et 
al., 2001). Such variables are useful for model estimation because they are related either 
to missingness or to the variables with MD. Inclusion of variables that account for miss- 
ingness increases the plausibility of the MAR assumption (recall that omitting variables 
from the estimation process that are relevant to nonresponse creates an MNAR condi- 
tion). Including such auxiliary variables in the estimation process reduces the possibility 
of bias due to MNAR. Graham (2003) provides detailed instruction for including auxil- 
iary variables in structural equation models and illustrates their utility in improving bias 
and efficiency of estimates. Such variables may be included as extra dependent variables, 
or as correlates of other variables in the model, with similar results under defined condi- 
tions (Graham, 2003). The technique is also outlined for linear growth models in Muthén 
and Muthén (1998-2010). 


Software Programs Using ML Estimation 


Several software programs are available that use ML estimation for longitudinal data 
modeling with missing values. Programs for mixed-effects models include HLM (Scientific 
Software International, Inc., 2005-2010), Statistical Analysis Software (SAS Institute, 
Inc., 2010), Stata (StataCorp LP, 1995-2010), Mplus (Muthén & Muthén, 1998-2010), 
S-Plus (TIBCO Software, Inc., 2000-2010), and R (R Development Core Team, 2010). 
Software packages are reviewed by Collins and colleagues (2001), Goldstein (2003), and 
Roberts and McLeod (2008). 


Incomplete Covariates 


An important caveat is that most software programs including HLM and SAS cannot 
handle missing values on model covariates, and delete incomplete cases with missing 
level-2 observations by default. In Mplus, missingness on covariates can be handled if 
distributional assumptions are specified in the model (Muthén & Muthén, 1998-2010). 


Bayesian Estimation 


Bayesian estimation techniques are similar to MLE in how missing values are handled. 
With Bayesian estimation, a prior distribution is specified that provides information 
about the population parameters before they are estimated from the data. The a priori 
information, combined with the likelihood distribution derived from MLE, produces a 
posterior distribution of parameter estimates from which point estimates are drawn (for 
a detailed review, see Gelman, Carlin, Stern, & Rubin, 2003; Little & Rubin, 2002; 
Raudenbush & Bryk, 2002). 
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Multiple Imputation 


With MI (Rubin, 1987) missing observations are replaced with m > 1 estimated plausible 
values to create multiple, alternative completed datasets (see also Little & Rubin, 2002; 
Schafer & Graham, 2002). The technique proceeds in three distinct phases: (1) Values 
are imputed, (2) completed data sets are analyzed individually, and (3) multiple parameter 
estimates are combined. MI provides the advantage of allowing complete data analytical 
routines while accounting for uncertainty of estimates due to imputation. 

In many cases, a small number of imputations is adequate for efficient parameter esti- 
mation (Schafer, 1997), but many more may be needed to improve efficiency (Graham, 
Olchowski, & Gilreath, 2007; Harel, 2007; Hershberger & Fisher, 2003). When select- 
ing a program, the imputer should consider the assumptions of the imputation model to 
be applied, including whether values are imputed according to a normal or linear mixed 
model (Harel & Zhou, 2007; Horton & Lipsitz, 2001). For longitudinal data, the impu- 
tation model should be the latter to avoid distortions of the data structure (Black, Harel, 
& McCoach, in press; Schafer, 1997; Yucel & Demirtas, 2010). Available programs for 
MI of longitudinal, or clustered, data include PAN (Schafer & Yucel, 2002) for R (R 
Development Core Team, 2010) and S-Plus (TIBCO Software Inc., 2000-2010) and an 
MI macro (Carpenter & Goldstein, 2004) that operates in MLwiN (Centre for Multilevel 
Modelling, 1982-2010). 

Specifying the imputation model can be complex, and idiosyncrasies of the program 
used for imputation should be well understood.! The imputation model must agree with 
the analytical model to the extent that model assumptions are commensurate, and substan- 
tive variables are included in the imputation model to preserve relationships when deriving 
plausible values (Schafer, 2003). Auxiliary variables can also improve imputation (Collins 
et al., 2001). Inclusion of auxiliary variables is more straightforward with MI; these vari- 
ables are included in the imputation model in the same way as other covariates. In analyses 
of the completed datasets, the analyst need only include variables of substantive interest, 
and can avoid cluttering the model with auxiliary variables (see Graham, 2003). 

Multiple imputation uses a Bayesian estimation technique (see Schafer, 1997, for 
technical details). The simulation of missing values is conducted in a three-step sequence. 
In the first step, random values for the level-1 random effects are drawn, given assumed 
values for the MD, fixed effects and covariances of the random effects and residuals. Sec- 
ond, new values for the fixed effects and covariances are randomly drawn given the esti- 
mated random effects and the assumed missing values. Third, plausible values to replace 
the MD are drawn from a distribution given the current values for the random and fixed 
effects, and the relevant covariance matrices. This iterative process results in a probability 
distribution of simulated values for the MD from which random draws are made to create 
m imputed datasets (Schafer, 2001). Higher rates of missing information, highly variable 
estimates of the random coefficients, and large sample sizes can contribute to slow conver- 
gence rates, and hundreds of iterations may be needed for convergence (Schafer, 2001). 


Combining Parameter Estimates and Calculating Standard Errors 


The results of analyses from m multiply imputed datasets are combined according to 
Rubin’s (1987) rules. Parameter estimates are combined by taking the average of all esti- 
mates of the same parameter: 
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O= > Ô, Im 
j=l 


where O, represents the estimate of the population value of interest for the jth imputed 
dataset. Standard errors are combined across imputations, accounting for the amount of 


variance within and between imputations. The within-imputation variance (U) is equal to 
the average of the parameter variances from each j imputed dataset (Uj): 


U =$. Um 


j=l 


The between-imputation variance, B, is calculated as 
B= $Ô, -OP Im 
= 
The total variance around the parameter, T, is calculated as 
T=U+ [1 + ~)p 
m 


and the final standard error estimate is the square root of T (see Schafer, 1997, p. 113). 
The final results of MI (Q, T) follow a ¢ distribution and can be used for statistical testing 
and creation of confidence intervals. 


Multiple Imputation with Missing Level-2 Covariates 


Because MD can occur at both levels, an imputer can implement Markov Chain Monte 
Carlo (MCMC) simulation methods (Schafer, 1997) in one of several ways to deal with 
this complication. Perhaps the simplest solution is to use a combination of procedures 
written for use in R. The imputer can use the PAN function in R to impute level-1 vari- 
ables, in conjunction with an appropriate function for imputing the level-2 variables. By 
conditioning on the level-2 variables, the level-1 variables can be imputed using PAN. 
Similarly, by conditioning on the level-1 variables, the level-2 variables can be imputed 
with an appropriate R imputation function. With continuous outcomes, level-2 variables 
can be imputed using a multivariate normal model; the appropriate function for imputa- 
tion of level-2 variables would be NORM. 

After imputing values for both level-1 and level-2 variables, the parameters of the 
imputation model can be updated. By iterating through this process (impute level-1 vari- 
ables, impute level-2 variables, update parameters) one creates a Gibbs sampler, which (if 
it converges) will converge to a posterior distribution of plausible values from which valid 
imputations can be drawn. Because the technique is complex, it is strongly recommended 
that the imputer have specific expertise in use of these operations.” 


Methods for Nonignorable Missingness 


The methods outlined to this point in the chapter are appropriate for ignorable missing- 
ness conditions. When missingness is MNAR, a model for the missingness must be speci- 
fied along with the data model, because the two are dependent. 
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Options for modeling data with nonignorable missingness include pattern mixture, 
selection, and shared parameter modeling. Pattern mixture modeling (Little, 1993) allows 
the analysis of longitudinal data with values MNAR by identifying classes of MD pat- 
terns among respondents and including these patterns in the MLE model (Fitzmaurice, 
2003; Little, 1995). With this technique, missingness patterns serve as grouping variables 
that are included as covariates in the longitudinal model. Selection models (Diggle & 
Kenward, 1994) are similar to pattern mixture models in that both a longitudinal model 
and a dropout model are specified and combined. A primary difference, and advantage, 
of selection models is that the data model is estimated directly rather than conditionally 
upon the dropout model. This allows the analyst to focus on estimates of substantive 
interest, and account simultaneously for nonignorable patterns of dropout. Details about 
this technique are available in Fitzmaurice (2003; see also Demirtas & Schafer, 2003; 
Hedeker & Gibbons, 2006). With shared parameter models (SPMs; Follmann & Wu, 
1995; Tsonaka, Verbeke, & Lesaffre, 2009; Wu & Carroll, 1988; Yuan & Little, 2009), 
the assumption is made that missingness is related to the data through a set of random 
effects, such as errors in measurement. By factoring out the underlying random effects, 
the data can be modeled independent of the missingness process. 


Example 


To illustrate the application of selected MD techniques, we began with the subset of diary 
data collected by Conner (2009), described earlier. The selected variables in the dataset 
were Mood, a continuous, normally distributed outcome variable that was measured 
daily; Day in the Study, centered at Day 1 (Day_C); Weekend, a dichotomous indicator 
of weekend day; Headache, rated 0-4 for severity; Stress, rated 0-4 for severity; and 
NDrinks, representing the number of drinks the previous night. The data included obser- 
vations for 263 participants, with a range of 11 to 21 days of completed diaries. For the 
sake of our example, this dataset served as the completely observed data. 

To impose a general pattern of missing values under the condition of MAR, we 
specified logistic regression equations with missingness (1/0) on two variables modeled as 
a function of other variables in the dataset. 

For the Mood variable, 30% of observations were removed, conditional upon values 
of Day_C and Weekend, to create the MAR condition that missingness was more likely 
for later observations and on weekends. Higher rates of missingness were created to be 
more likely for participants with higher rates of self-reported stress. The models for miss- 
ingness were specified: 


High-rate MD: Logit(Pm mood) =-1 + 1.06* Weekend + .08*Day_C 
Low-rate MD: Logit(Pm mood) = -2.75 + 1.06* Weekend + .08*Day_C 


where Logit(p) = log p/(1 - p), and Pm moog Is the probability that Mood is missing. 

We removed 40% of observations for Headache based on missingness for Mood and 
additionally on values of Stress to create the conditions that (1) when Mood was missing, 
all data were missing and (2) for participants with higher levels of stress, the Headache 
item was more likely to be skipped. The model for missingness based on Stress was 
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Logit(Pyy Headache) = —2-21 + .314* Stress 


where PM Headache 18 the probability that Headache is missing. 

Given the arbitrary pattern of missingness, 260/263 (99%) of cases had at least one 
missing observation. 

We modeled the incomplete data using four MD options: case deletion, LOCF, MLE, 
and MI. For the case deletion, we removed cases with > 25% of observations missing. 
This resulted in n = 135 cases with at least 75% of scheduled observations completed. 
In the LOCF condition, missing values were replaced with the last observed value. MLE 
was conducted in R (R Development Core Team, 2010), using the linear mixed effects 
(Ime) routine. MI was conducted using the PAN routine in R (Schafer & Yucel, 2002). 
We calculated and combined estimates from 100 imputation-completed datasets (statisti- 
cal code can be requested from the second author). The imputation model included all 
variables in the substantive model. 

The substantive model estimated Mood as a function of Day_C, Stress, Headache, 
Weekend, and NDrinks, allowing the intercept and the slopes of all predictors except 
Weekend to vary randomly: 


Level-1 model: Mood,; = To; + T4; (Headache,,) + 1; (Stress,,) + 13; (Day_C,,) 
+ 1,4; (Ndrinks,,) + n,; (Weekend,,) + e, 


Level-2 models: To; = Boo + Foz Ta; = Bio + 718 Ta; = Boo + Tas T3; = Bao + 7335 Ta; = Bao + Tas 


Ts; = Bso 


The results for each of these MD options compared to the complete dataset are 
shown in Table 19.1. As expected, model fixed effects estimates and standard errors 
obtained with ML and MI were substantially more representative of complete data esti- 
mates than those obtained with case deletion or LOCF. In the case deletion illustration, 
standard errors (SEs) for two of six fixed effects (Headache and Weekend) were inflated 
by > 50% of the complete data values, and three parameter estimates (B40, Bao, and Bso; 
the effects of Headache, NDrinks, and Weekend, respectively) were biased by more than 
one standard error. Three of six fixed-effects estimates were substantially biased in the 
LOCF illustration (Boo, By, and Bso, the level-2 intercept, and effects of Stress and Week- 
end, respectively) although SEs were accurate, with none inflated by 50% or more. How- 
ever, the best estimates of fixed effects were obtained with MI and ML, with only one 
estimate (8), the effect of Headache) biased by more than one SE in each condition, and 
no SE inflated by > 50% of complete data values. In this example, because the estimates 
of the random effects are so small, the impact of the missing values is minimal, and the 
principled MD techniques did not provide a particular advantage for random effects 
estimates. However, over replications, these techniques have demonstrated better perfor- 
mance than ad hoc routines. 


Summary 


Under appropriate assumptions and model specification, properly selected and imple- 
mented MD techniques can preserve the distributional characteristics of variables of 
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interest, and provide valid inferences about population parameters. The selection of tech- 
niques should be informed by careful analysis of the nature of missing observations. 
Analysts should test the tenability of the MCAR assumption, and thoughtfully consider 
the plausibility of the ignorable missingness assumption. Anticipating and planning for 
missing observations can improve the quality of analyses; measures can be taken to mini- 
mize nonresponse, and variables related to nonresponse can be measured and modeled. 

The principled methods described are complex and do not perform optimally under 
all conditions. The analyst should fully understand the assumptions and requirements of 
a technique before applying it. Alternatively, enlisting the help of a statistician versed in 
the techniques may be well worth the investment. 

We implore researchers to disclose rates of MD, MD assumptions, and the methods 
used to address them in published work. This practice will promote the application of 
proper techniques, and a greater understanding of the methodological and statistical 
issues involved in handling incomplete data. 


Recommendations 


With regard to MD, the best thing to do is to avoid it. The second best thing is to under- 
stand it, plan for it, and address it with appropriate modeling techniques. 


1. Plan for missingness. When planning an intensive longitudinal study, researchers 
should anticipate inevitable MD, and review typical causes for the population of interest 
given the nature of the study. Variables determined to relate to nonresponse should be 
identified and measured. 


2. Minimize nonresponse. Researchers can minimize missed observations by incor- 
porating procedures into the study plan to reduce missed assessments and ensure regular 
review of data. 


3. Determine the mechanism of missingness. Analysts should test the assumption of 
MCAR and carefully consider the plausibility of ignorable missingness. 


4. Apply appropriate techniques. Principled techniques are effective when applied 
appropriately under proper assumptions but provide misleading results when implemented 
incorrectly. Specifying the models properly and carefully considering assumptions about 
the nature of missingness help to ensure meaningful and valid results. 


5. Report missingness and techniques used. As with all other aspects of data- 
analytic methods, researchers should fully describe MD methods and the occurrence of 
incomplete data, MD assumptions, and the techniques selected to handle them. 
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Notes 


1. For example, the PAN program that operates in S-Plus or R requires that all incomplete variables 
be specified as response variables in the imputation model, whether in the analytical model they are 
level-1 covariates or response variables. 

2. Carpenter and Goldstein (2004) describe a macro written for MLwiN that automates multiple impu- 
tation for multilevel data, whereby variables with MD are temporarily specified as outcome variables 
for imputation. Readers are referred to www.missingdata.org.uk for more information. 
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CHAPTER 20 


Multilevel Modeling Analyses 
of Diary-Style Data 


JOHN B. NEZLEK 


R esearchers who use ambulatory assessment methods and various types of diaries are 
increasingly (almost invariably) using some type of multilevel technique to analyze 
their data. This reflects the fact that the data collected in such studies are inherently mul- 
tilevel. A sample of individuals provides data on a repeated basis, creating a multilevel 
data structure in which people constitute one level of analysis and the repeated measures 
they provide constitute another level, or levels, of analysis. In this chapter I discuss mul- 
tilevel random coefficient modeling (MLM), the technique that is currently thought to be 
the best way to analyze such multilevel data structures. I introduce MLM and present 
ways of using MLM that are well suited for analyzing data generated in ambulatory 
assessment and diary studies. 

When writing this chapter, I tried to address the needs of two audiences: researchers 
who are quite familiar with MLM to analyze data collected using ambulatory assess- 
ment and other intensive repeated measures, and scholars (both new and established) 
who are not at all familiar with such applications. To address the needs of these different 
audiences I have provided sufficient introductory material to allow those who are unfa- 
miliar to understand the basic principles involved, while describing more sophisticated 
applications for the benefit of those who are already familiar. Consequently, those who 
are familiar with MLM analyses of diary data may wish to skip or skim introductory 
sections and focus on sections dealing with specific topics or applications. 

For those who are not familiar with MLM and want to know more about the tech- 
nique, I recommend the following. For introductions to MLM per se, Raudenbush and 
Bryk (2002), Kreft and de Leeuw (1998), Snijders and Bosker (1999), and Hox (2002) 
are all accessible to the nonexpert (albeit with some extra effort to get through some sec- 
tions). In terms of using MLM to analyze diary-style data per se, I have written a few 
articles that may be helpful (Nezlek, 2001, 2003, 2007a, 2007b). Although there is some 
overlap among them, the articles vary in terms of focus and detail. I also recently pub- 
lished a book that should be helpful (Nezlek, 2011). 
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Consistent with the focus of this handbook, I discuss MLM in terms of the types of 
multilevel data that are frequently collected in ambulatory assessment and diary research, 
although much of what I discuss can be applied to other types of data. Moreover, to 
illustrate certain points, I often refer to my own research. I have done this not because 
I am the only the person who has used MLM to analyze these types of data (quite the 
opposite; there are many experienced and knowledgeable scholars who have published 
MLM based studies in this area); rather, I am more familiar with my own studies than I 
am with the work of others. 


Conceptual and Statistical Background 


Multilevel data structures are sometimes referred to as nested data, because observa- 
tions at one level of analysis are nested within observations at another level of analysis. 
Whether the data are what Wheeler and Reis (1991) described as interval contingent 
(e.g., a daily diary study), signal contingent (e.g., a beeper study), or event contingent 
(e.g., a social interaction diary study) does not matter for present purposes. A study in 
which people provide the same data each day for a period of time would be described as 
days nested within persons. If multiple assessments are collected each day, it may be use- 
ful to think of such data in terms of three levels: assessments nested within days and days 
nested within people; if people describe their social interactions they had over a period of 
time, such data would be described as interactions nested within persons. 

Inferential statistics provide a basis to make inferences about the characteristics 
(parameters) of the population from which a sample of observations has been drawn. It is 
critical to recognize that a multilevel data structure is created by drawing samples from 
different populations, and so there are two (or more) targets of inference. In a daily diary 
study, one sample is people and the other is days. In a social interaction diary study, one 
sample is people, and the other is interactions. In a daily diary study, researchers may 
want to draw inferences about day-level relationships such as that between daily stress 
and daily mood. They may also want to draw inferences to the population of people 
about how mean levels of daily observations (e.g., mood) and within-person relationships 
between daily observations (e.g., mood-stress) vary across persons. 

Such multiple sampling means that the sampling error associated with drawing sam- 
ples at each level of analysis needs to be taken into account. For example, a coefficient cal- 
culated over a specific 2-week period representing a within-person relationship between 
two daily measures is, in fact, sampled from a population of such coefficients. For the 
same person, a coefficient based on data collected during the first 2 weeks of July will 
not be exactly the same as a coefficient based on data collected during the first 2 weeks 
of August, even if that person’s life has not changed. When making an inference to the 
population of days, there is some error in an estimate of a population parameter based 
on a sample of days, just as there is some error associated with estimates of population 
parameters based on a sample of persons. 

A critical shortcoming of ordinary least squares (OLS) regression analyses of the 
types of nested data typically collected in diary-style studies is their inability to take into 
account simultaneously the error (variance) associated with the sampling of observa- 
tions at multiple levels of analysis. For example, if within-person relationships between 
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variables were represented with within-subject correlations, and these correlations were 
then used as dependent measures in another analysis, the random error associated with 
these correlations would not be part of the analysis. Single-level OLS regression analyses 
in which person-level variables are assigned to daily measures are also flawed for similar 
reasons, no matter how elaborate the correction or adjustment. These and other short- 
comings are described in more detail in Nezlek (2001, 2003, 2007b, 2011). 

At this point it suffices to note that the type of multilevel analyses described in this 
chapter provides the most accurate estimates of parameters for common types of mul- 
tilevel data structures in ambulatory assessment research. Accuracy in this instance is 
defined in terms of the results of Monte Carlo studies, in which samples are drawn from a 
population with known parameters, and the correspondences between these parameters 
and the estimates produced by different techniques are analyzed. 


Basic Models in the Analyses of Diary-Style Data 


Within the nomenclature of MLM, the levels in a multilevel dataset are labeled level 1, 
level 2, and so forth, with lower numbers nested within higher numbers. For a study in 
which participants provided data each day for some period of time, daily observations 
would constitute the level-1 data, and data describing the participants would constitute 
the level-2 data. Note that the number of level-1 observations does not have to be the 
same for all level-2 units. People can provide different numbers of days. The minimum 
number of level-1 observations a level-2 unit should have is two. If there is only one 
level-1 observation for a level-2 unit, the variance between the two levels cannot be dis- 
tinguished. In a diary study, if people provided only 1 day of data, it would not be possi- 
ble to distinguish within- and between-person variance (level 1 and level 2, respectively), 
because there is no way to estimate the within-person variance. How to conceptualize the 
levels in an analysis is discussed in more detail later. In this chapter, I focus on two-level 
models, although the principles and techniques I describe can be applied to models with 
more than two levels. 

Within MLM, level-2 units are often referred to as groups, even though they may not 
be actual groups. For example, in a diary study in which days are nested within persons, 
persons are typically described as a grouping variable. This convention may be confus- 
ing initially, but it is well established and unlikely to change. When discussing MLM, I 
rely on the explanatory framework developed by Bryk and Raudenbush (1992). In this 
framework, they present separate equations for each level of a model. For a diary study, 
there would be a day-level equation (level 1) and a person-level (level 2) equation. Within 
this framework, coefficients at lower levels of analysis are “brought up” to higher levels 
of analysis. Although all coefficients in a MLM are estimated simultaneously, I think 
separating the equations for each level of analysis clarifies what is being done. Moreover, 
I think presenting MLM in this way is particularly helpful for readers who may not be 
familiar with MLM. 

The basic two-level model is presented below. In this model i observations of a con- 
tinuous measure y are nested within j level-2 units. Although conventions vary some- 
what, I describe models using ß (beta) for level-1 coefficients and y (gamma) for level-2 
coefficients. 
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Vij = Bo; +7; 


Boj = Yoo + “oj 


Such a model is referred to as an unconditional or null model, because there are no 
predictors at any level. Level-1 observations for each level-2 unit are modeled as a func- 
tion of the mean for that level-2 unit (B9;), and the level-1 variance is the variance of the 
deviations of each score from that mean (7;). In turn, the means for each level-2 unit are 
modeled as a function of the grand mean (the mean of the means, Yo), and the level-2 
variance is the variance of the deviations of each mean from the grand mean (u,,). 

If daily observations of mood were nested within person, the level-1 equation would 
be the within-person (day-level) model, By; would be the mean mood for each person 
collapsed across days, and the level-1 variance would be the day-level (within-person) 
variance. Correspondingly, the level-2 equation would be the between-person model, Yoo 
would be the grand mean for mood, and the level-2 variance would be between-person 
variance. 

To examine level-2 differences in level-1 intercepts (e.g., relationships between Neu- 
roticism and mean daily Mood) a predictor could be added at level 2 (person-level), as 
shown below. Such an analysis is functionally equivalent to calculating a mean score 
for Mood for each person, then correlating this mean with Neuroticism scores at the 
between-person level. If the y), coefficient is significantly different from 0, then the rela- 
tionship between mean daily Mood and Neuroticism is statistically significant. 


Yu Po; +7; 


Po; = Yoo + Yor (Neuroticism) + uto; 


To examine level-1 relationships (e.g., within-person relationships between Stress 
and daily Mood) predictors can be added at level 1 (the day-level), as shown below. 


Yj = Bo; + By; (Stress) + r; 
Boj = Yoo + “oj 


By = Yo + 44; 


Note that for each level-2 unit (e.g., each person) an intercept (ß,,) and a slope (B4; 
representing the relationship between the predictor and the outcome, Stress and Mood) 
are estimated, and these level-1 coefficients are then analyzed at level 2. Such an analysis 
is roughly equivalent to calculating a regression equation for each person, then analyzing 
these coefficients at the person level. The hypothesis that the mean intercept is different 
from zero is tested with the Yo, coefficient, and the hypothesis that the mean slope (the 
relationship between Stress and Mood) is different from 0 is tested with the y,, coef- 
ficient. Note that in this model there are separate error terms for the intercept and the 
slope, a topic discussed in the next section. 

It is also possible to examine how level-1 slopes vary as a function of level-2 measures. 
Such an analysis is sometimes referred to as a “slopes as outcomes” analysis because a 
slope from level 1 becomes an outcome at level 2. When a level-1 slope varies as a func- 
tion of a level-2 measure, this is sometimes called a “cross-level interaction,” or a moder- 
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ated relationship because a relationship at one level of analysis is varying as a function of 
a variable at another level. 

The model below examines whether the Mood-Stress slope at level 1 varies as a 
function of Neuroticism at level 2. Note that Neuroticism is included in the level-2 equa- 
tions for both the intercept and the slope. Even if there is no hypothesis concerning the 
intercept, it is best to include the same predictors in all level-2 equations (at least initially). 
This recommendation reflects the fact that estimating coefficients in MLM is based on 
covariance matrices, and if a level-2 predictor is not included in the equation for a level-1 
coefficient, then it is assumed that the level-2 predictor is not related to that particular 
level-1 coefficient. If this assumption is incorrect (i.e., if the predictor is in fact related), 
and the predictor is not included in the model, the model may be misspecified. 


Vij = Bo; + Bi; (Stress) + 7; 


Bo; = Yoo + Yor (Neuroticism) + uo; 
Bi = Vio + Yu (Neuroticism) + 14; 


Random Error and Modeling the Variability of Coefficients 


One of the advantages of MLM analyses compared to OLS techniques is the ability to 
model multiple error terms simultaneously. In the previous example, the intercept and 
slope each had an error term. When a level-1 coefficient is modeled with an error term at 
level 2, it is described as randomly varying. When a level-1 coefficient is modeled without 
a random error term, it is described as fixed (not to be confused with the estimate of the 
fixed effect, i.e., the coefficient itself). Finally, if a level-1 coefficient is modeled without a 
random error term but differences in the coefficient are modeled as a function of a vari- 
able at level 2, it is described as nonrandomly varying. 

Some guidelines for modeling the error structure of a model are presented in the 
section below on model building. My sense is that most, if not virtually all, coefficients 
in diary-style research should be modeled as randomly varying if the random effects can 
be estimated reliably. I have read articles in which authors have not modeled coefficients 
as random (a term used instead of the more cumbersome randomly varying), because 
they believed there was no compelling theoretical reason to assume a coefficient (e.g., a 
level-1 slope) should vary across level-2 units. While there may not be a reason to expect 
a coefficient to vary, such an expectation on its own is a poor justification for assuming 
that it does not vary, particularly when such an assumption can be tested. If, as discussed 
below, the data do not provide a basis to estimate such variability, fine, do not model it. 
On the other hand, if the data provide a basis for such an estimate, it seems inappropriate 
to ignore the reality suggested by the data, irrespective of what some theoretical position 
might lead one to assume. 

Regardless, it is possible to model level-2 differences in a level-1 coefficient even 
when the random error term associated with such a coefficient is not significant. Some 
might argue that the lack of a significant random error term for a coefficient means there 
is no variability and there is therefore nothing to model. Although such an approach is 
understandable, the lack of a significant random error term simply means that the data 
cannot estimate the random error—that is, the data do not provide a sufficient basis for 
distinguishing fixed (true) and random variation. The lack of a significant random error 
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term does not mean there is no variability: It means that the variability cannot be modeled 
given the data at hand. When a level-2 predictor is included in a model (when a coefficient 
is modeled as nonrandomly varying), more information is introduced into the model, and 
this information may provide a basis to model the variability in a coefficient. 

Finally, it is important to keep in mind that it may be useful to model the variability 
in a coefficient even if the fixed effect is not significantly different from 0. For example, 
in the case of a slope, it is possible that the slopes for half the level-2 units (e.g., people) 
are positive, and half are negative, resulting in a mean of 0. If this is the case, it may be 
possible to find a level-2 variable that can model such differences. 


Model Building 


Although there are no absolute rules about how to build models in MLM analyses, there 
are guidelines. Analysts should first run unconditional models, described previously, for 
all the measures in a study. Such models provide the basic descriptive statistics of MLM 
analyses—estimates of means and variances at each level of analysis. Moreover, these 
estimates of the distribution of variances can provide some insight into the levels of analy- 
sis that might be the most productive. If there is little variance in a measure at a certain 
level of analysis, it may be difficult to model that variance. For example, in Nezlek, 
Kafetsios, and Smith (2008), a study in which social interactions were nested within per- 
sons, we were able to model person-level (level 2) differences in the positive affect people 
experienced in their interactions, but we were not able to model person-level differences 
in the negative affect they experienced. One reason for this difference may have been the 
relatively small amount of between-person variance in our measures of negative affect. 
Noting this, even when there is relatively little variance at a level of analysis, it may still 
be possible to model that variance. The lack of variance at a level of analysis does not 
indicate that variance cannot be modeled: It simply means that it may be difficult to 
model. Finally, the variance estimates provided by unconditional models can be used to 
calculate intraclass correlations (ICCs). 

How to add predictors is another important aspect of model building, and there are 
two widely recognized guidelines for this. First, level-1 models should be finalized before 
level-2 predictors are added. Finalization includes specifying the predictors and the error 
structure. Second, in terms of adding predictors, forward stepping rather than backward 
stepping procedures are preferred. Unlike many OLS regression analyses in which all pos- 
sible predictors are added and those that are not significant are dropped from the model, 
within the multilevel context, it is advisable to enter predictors singly (or a few at a time), 
evaluate their contribution to the model, then decide to retain or drop them—building a 
model rather than tearing one down. 

In MLM, forward stepping procedures are preferred (particularly at level 1) because 
of the number of parameters estimated in a model. As predictors are added, the number 
of parameters that are estimated may tax what is sometimes called the carrying capacity 
of a dataset—the number of parameters a dataset can estimate. For a level-1 model witha 
single predictor, six parameters are estimated: the level-1 variance, the fixed and random 
terms for the intercept (2), the fixed and random terms for the slope (2), and the covari- 
ance between the two random error terms (1). When there are two level-1 predictors, 10 
parameters are estimated: the level-1 variance, a fixed a random term for the intercept 
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and for each of the two predictors (6), and the covariances between the three random 
terms (3). When there are three predictors, 14 parameters are estimated: a fixed and ran- 
dom term for the intercept and for each of the three predictors (8), and the covariances 
among the four random terms (6). In comparison, in OLS regression, only one error term 
is estimated, and only one parameter is estimated for each predictor. 

The norm within the multilevel context emphasizes tighter, more parsimonious mod- 
els that include only variables that have explanatory power over models with many pre- 
dictors that provide less precise estimates of individual coefficients. Of course, analysts 
will need to take into account the norms of their home disciplines regarding the type of 
models considered to be appropriate. Nevertheless, to the extent possible, they should be 
guided by the law of parsimony—in this regard, less is truly more. 

The final consideration I discuss is modeling error. Unless there are compelling rea- 
sons to think otherwise, I recommend that coefficients in diary-style research should 
be modeled as random. As discussed earlier, occasions of measurement are invariably 
sampled from some population of occasions, and this sampling error needs to be taken 
into account. Nevertheless, it may not be possible to model reliably the random error for 
all the coefficients in a model, raising questions about how to treat nonsignificant random 
error terms. There is broad agreement, although not a true consensus, that nonsignificant 
random error terms should be deleted from models. 

Following such a guideline requires significance tests of random error terms. There 
seems to be some agreement that a more liberal p level than .05 should be used for mak- 
ing decisions about retaining random error terms—something like p < .10 as a cutoff, 
with anything greater than .15 or .20 being grounds for dropping a term. What about 
values in the .10 to .15 range? In such cases, I recommend running the analysis with and 
without such terms and seeing what impact the inclusion or exclusion has on the fixed 
effects, which are the focus of most hypotheses. If excluding the terms makes no differ- 
ence, then they should be dropped. If their presence changes the fixed effects, why this 
occurred should be considered. Having a clear standard for including or excluding ran- 
dom error terms provides a context for modeling error structures properly. To facilitate 
this process, I recommend evaluating error structures before examining fixed effects. 
If this is done, decisions about error terms will be made without regard to the impact 
changes in the error structure may have on significance tests of the fixed effects. 


Centering 


A critical aspect of multilevel modeling is the centering of predictors. Centering refers to 
the reference value around which the deviations of a predictor are taken, and there are 
different centering options available for different levels of analysis. It is critical to keep in 
mind that centering determines what intercepts represent. In this section, I provide a brief 
discussion, with some basic recommendations, although as noted by Bryk and Rauden- 
bush (1992, p. 27), “No single rule covers all cases,” and analysts need to make choices 
based on their specific needs. See Enders and Tofighi (2007) for a thorough and informed 
discussion of centering options in MLM. 

At the top level of a model (level 2 in a two-level model, level 3 in a three-level model, 
etc.), there are two options. Predictors can be entered either grand mean-centered or 
-uncentered (sometimes referred to as zero-centered). The grand mean is the mean of 
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a variable for the entire sample. When a predictor is entered grand mean-centered, the 
slope is calculated based on deviations from the grand mean, and the intercept represents 
the expected outcome for a unit (e.g., a person) that is at the grand mean of the predictor. 
This is the same as in OLS regression. 

When a predictor is entered uncentered, the slope is calculated based on deviations 
from 0, and the intercept represents the expected outcome for a unit that has a value of 
0 on the predictor. If gender was represented with a dummy-code (0 = male, 1 = female), 
and gender was entered uncentered, then the intercept would represent the expected value 
when gender = 0 (i.e., for a man). Although some programs allow zero-centered OLS 
regression, such use is not that common. 

Selecting a centering option at this level is fairly straightforward. Continuous predic- 
tors are usually entered grand mean-centered, and categorical predictors can be entered 
either way. There is some sense among modelers that it is better to enter categorical 
predictors uncentered because this makes the meaning of the intercept clearer. More- 
over, continuous predictors can be entered uncentered if 0 is a meaningful value for the 
predictor. If 0 is not meaningful (e.g., a predictor is on a 10- to 100-point scale), it makes 
little sense to enter a predictor uncentered and estimate an intercept that represents an 
expected value of an outcome when the predictor cannot be 0. 

At other levels of analysis (e.g., level 1 in a two-level model), predictors can be entered 
uncentered, grand mean-centered, or group mean-centered. Moreover, how predictors 
are centered at lower levels of a model typically has more of an influence on the results 
of an analysis than how predictors are centered at the top level. This is because coef- 
ficients are passed “up” from lower levels to be analyzed at higher levels, and centering 
changes what these coefficients represent. I discuss centering options in terms of level 1 
of a two-level model, but the same principles apply to lower levels per se (e.g., level 2 of 
a three-level model). 

Similar to centering predictors at the top level, when a level-1 predictor is entered 
uncentered, the intercept (which is now the intercept for a group of observations) rep- 
resents the expected outcome for an observation in a group that has a value of 0 on a 
predictor. So, if we had a diary study in which days were nested within persons and a 
variables DAYS was coded weekend = 0, weekday = 1, if DAYS was entered uncentered, 
then the intercept for each person would represent the expected score for weekend days 
(i.e., when the predictor DAYS = 0). 

When a level-1 predictor is grand mean-centered, the intercept represents the expected 
outcome for an observation in a group that is at the grand mean of a predictor (i.e., the 
mean of all observations in the sample). One of the results of grand mean-centering 
level-1 predictors is that when predictors are grand mean-centered, level-1 intercepts are 
adjusted for level-2 differences in predictors. For example, in a social interaction study, a 
hypothesis may concern the relationship between Neuroticism and the Intimacy people 
find in their interactions, and the general structure of the analyses is that interactions are 
nested within persons—as below. 


Vij = Po; +7; 


Po; = Yoo + Yor (Neuroticism) + uto; 


If a variable representing whether an interaction was with a close friend was entered 
grand mean-centered at level 1, the resulting intercept would represent the mean inti- 
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macy people found in their interactions, adjusted for individual differences in the percent 
of interactions that involved a close friend. 


Y; = Bo; + By; (Close Friend) + r; 


If Neuroticism was also related to the percent of interactions people have with close 
friends (let’s assume negatively), it is possible that relationship tested by the yp, coefficient 
in the first model is confounded. If more neurotic people have fewer interactions with close 
friends than less neurotic people, and interactions with close friends are more intimate 
than interactions with non-close others, the level-2 relationships between Neuroticism 
and mean Intimacy may reflect individual differences in the distribution of interactions. 
By entering Close Friend grand mean-centered at level 1, such a confound is corrected. 

At lower levels of a model, predictors can also be group mean-centered. When a 
predictor is group mean-centered, deviations are taken from the mean of a predictor for 
each group, and the intercept represents the expected outcome for an observation for 
which the predictor is the group mean of the predictor. Aside from rounding error, when 
predictors are entered group mean-centered, the intercept for each group is unchanged 
from an unconditional model. Conceptually, entering predictors group mean-centered is 
similar to conducting a regression analysis for each group and using the coefficients from 
these analyses as the dependent measures in another analysis, a procedure that is some- 
times referred to as two-stage regression. 

Similar to the recommendations for centering at level 2, there is a general sense 
that categorical predictors should be entered uncentered, in part because this facilitates 
interpretation of the intercept—the intercept is now the expected value when a predic- 
tor is 0. For continuous predictors at level 1, there is some disagreement about grand- 
versus group mean-centering. I (and many other analysts) favor group mean-centering 
continuous predictors. When predictors are group mean-centered, level-2 differences in 
predictors do not contribute to the results, and the analysis is conceptually equivalent 
to conducting separate regression equations for each group (level-2 unit), then analyz- 
ing these coefficients. For example, in a daily diary study in which daily Mood is being 
predicted by Stress, individual differences in mean daily stress would not influence the 
parameter estimates—in essence, they would be controlled. In contrast, if Stress were 
entered grand mean-centered, then estimates of the level-1 relationships between Mood 
and Stress would be influenced by level-2 differences in mean stress because daily mood 
would be modeled as a function of how much daily stress deviated from the grand mean 
of stress. The argument against group mean-centering of predictors is that this takes 
level-2 differences in predictors out of models, and does not represent the data properly. 
Regardless, the point remains that when level-1 predictors (either categorical or continu- 
ous) are grand mean-centered, the resulting intercepts are adjusted for level-2 differences 
in predictors. Whether such an adjustment is desirable will vary as a function of the ques- 
tions of interest and the data at hand. 


Comparing Fixed Effects 


This section introduces what I think is one of the most powerful but underutilized aspects 
of MLM—the ability to compare coefficients through the use of constraints on a model. 
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Through the judicious use of coding, centering, and adding a nested level representing 
a measurement model, analysts can specify models in which coefficients represent very 
specific entities (means or relationships), and these entities can in turn be compared, and 
very precise conclusions can be drawn. 

A constraint on a model consists of applying weights to a group of coefficients (i.e., 
constraining them in some way), and the impact of this constraint on the fit of the model 
to the data is examined. If constraining the coefficients as represented by the weights 
results in a poorer fit of the model, then the comparison represented by the weights is 
assumed to be significant. The value of the constraint is the sum of the products of each 
coefficient and the weight assigned in the contrast to each coefficient. Note that the sig- 
nificance test is always a test against 0, specifically, a x? test. 

Assume a study of daily mood and daily stress, and the question of interest concerns 
differences in the strength of the mood-job stress and mood-family stress relationships. 
A model of such analyses is below: 


Yj = Boj + Bij (Job) + Ba; (Family) + Tij 
Po; = Yoo + Uoj 
Bi = Yio + “aj 


Ba; = Yoo + 4; 


The mood-job stress and mood-family stress relationships are represented by the Yio 
and 5, coefficients, respectively, and the strength of these two coefficients can be tested 
by constraining them to be equal. In terms of weights, a simple weighting of 1 and -1 (in 
this case, it does not matter to which coefficient the weights are applied) tests the hypoth- 
esis that the two coefficients are the same. For example, if both slopes were —.25, then the 
constraint would sum to 0: 1*(-.25) + -1*(-.25) = 0, and the constraint result would not 
change the model fit. In contrast, if one of the slopes was -.50 and the other was -.25, 
then the constraint would not sum to 0, and it might be significantly different from 0. 

Constraints can involve any number of coefficients. For example, if there were three 
level-1 predictors, the first two could be compared to the third with weights of 1, 1, and 
-2. In addition, constraints can involve multiple comparisons simultaneously. With three 
predictors at level 1, a single constraint could consist of two comparisons, such as 1, —1, 0 
and 1, 0, -1. Testing multiple comparisons simultaneously helps to control Type I errors, 
although similar to the results of an analysis of variance (ANOVA) with multiple groups, 
on the basis of constraints with multiple comparisons, it is not possible to know exactly 
which coefficients differ from each other. Note that the degree of freedom (df) of the 2 
test of the constraint corresponds to the number of comparisons in the constraint. When 
there is one comparison, the test has 1 df; when there are two comparisons, it has 2 df, 
and so forth. 

The previous examples describe ways of determining whether coefficients (or com- 
binations of coefficients) differ from each other per se, which is likely to be the focus of 
most, but not all, questions. For example, a researcher may be interested in the relative 
strength of two coefficients, irrespective of their sign (i.e., a comparison of the absolute 
value). This could be the case in a study of relationships between a daily outcome, such 
as self-esteem, and two other daily measures, such as positive events and negative events, 
as in the level-1 model below: 


Multilevel Modeling Analyses of Diary-Style Data 367 
Yi = Bo; + By; (Positive) + B,; (Negative) + r; 


For the sake of argument, assume that the coefficient for positive events is +.25, and 
the coefficient for negative is —.25. Using the “standard” weights of 1 and -1 described 
previously, these two coefficients might be significantly different from each other (1*.25 + 
-1*-.25 = .50). Nevertheless, it is quite clear that their absolute values are not different— 
they are identical. To compare the absolute values, weights of 1 and 1 can be used. Note 
that when the coefficients are of equal magnitude, +.25 and -.25, the sum of the products 
is 0 (1*.25 + 1*-.25 = 0). If the coefficients were of different magnitude (e.g., +.25 and 
-.50), the constraint might be significant because it would not be 0 (1*.25 + 1*-.50 = 
-.25). For an application of this technique see Nezlek and Plesko (2001). 


Coding 


For present purposes, I consider two types of codes, dummy- and contrast-codes. Dummy- 
codes occur when a measure is represented by a 1 (usually indicating the presence or exis- 
tence of something) and a 0 (indicating the lack of something). Contrast-codes consist of 
some type of comparison, with the limitation that the values that make up the contrast 
sum to 0. To compare two entities, a contrast code of 1 and -1 could be used. For three 
entities, a code of 1, 1, and -2 could be used. The weights in a contrast-code need to sum 
to 0 because in MLM, significance tests test coefficients against 0. If the coefficient for a 
contrast is significantly different from 0, then the differences represented by that contrast 
are significant. 

Contrast- and dummy-codes provide different advantages, and I discuss these advan- 
tages within the context of a social interaction diary study in which interactions are 
nested within persons. The hypotheses concern differences in how intimate people find 
interactions that involve and do not involve a close friend, and interactions are classified 
as a function of whether a close friend was present or not. 

This could be done using a contrast variable at the interaction level (level 1), and 
modeling intimacy as a function of this contrast variable. 


Y; = Bo; + By; (Close Friend — Contrast) + r; 

If the contrast variable was entered as uncentered, the resulting intercept would rep- 
resent the intimacy for interactions that were “neutral,” and the slope would represent 
the difference. In turn, these coefficients could be brought up to level 2 and modeled as a 
function of an individual-difference measure, such as Neuroticism. 


Po; = Yoo + Yo, (Neuroticism) + Ug; 


Bi; = Yio + Yu (Neuroticism) + 14; 


What is critical about this model is that the level-1 slope represents the difference 
in intimacy between interactions with friends present and interactions with no friends 
present, and accordingly, when this slope is modeled at level 2, individual differences in 
a difference score are being analyzed. See Schaafsma, Nezlek, Krejtz, and Safron (2010) 
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for an application of such a model in which the level-1 model compares the interactions 
of ethnic/minority members that involved or did not involve a member of their ethnic 
group. 

Although estimating such a difference score can serve many purposes, it cannot be 
used to test some hypotheses. Continuing the present example, a researcher might be 
interested in knowing whether the relationship between Neuroticism and Intimacy with 
friends is the same as the relationship between Neuroticism and Intimacy with non- 
friends. What is needed to test such a hypothesis are separate estimates of the intimacy 
people find with friends and without them, and a difference score does not provide such 
estimates. 

Providing separate estimates for level 1 requires using a set of dummy-codes in what 
is sometimes referred to as a no-intercept or zero-intercept model. In such models, the 
intercept is dropped, and the outcome is modeled as a function of a series of dummy- 
codes (one representing each category of a classificatory system). To continue the exam- 
ple, a model such as that below could be run. 


Y; = By; (Friend) + B,; (NoFriend) + r; 
Bi; = Yio + Yu (Neuroticism) + u; 


Bo; = Vo + Yor (Neuroticism) + uz; 


In the level-1 model, Friend is a dummy-coded variable indicating whether a close 
friend was present, and NoFriend is the corresponding variable indicating whether no 
close friend was present. Understanding how such a model estimates means for these 
two types of interactions involves estimating expected values. For interactions in which 
a friend was present, Friend = 1, and NoFriend = 0, and the expected outcome is the B,; 
coefficient, because the NoFriend coefficient drops out. In contrast, for interactions in 
which a friend was not present, Friend = 0, and NoFriend = 1, and the expected outcome 
is the B,; coefficient because the Friend coefficient drops out. Determining whether the 
relationship between Neuroticism and Intimacy with friends is the same as the relation- 
ship between Neuroticism and Intimacy with nonfriends is done at level 2 by constrain- 
ing the y,, and y, coefficients to be equal. 

This type of dummy-coded analysis can be used at any level of a model, with any 
number of categories. Each category needs to be represented with its own code, and two 
categorical systems can be combined to provide the basis for more complex comparisons. 
Assume a study in which participants are men and women who either received some type 
of treatment or not, and the dependent measure is daily mood. If sex and treatment are 
combined to create a four-category system, any combination of groups can be compared 
to any other using tests of fixed effects. The level-2 predictors are four dummy-codes, 
each representing a combination of sex and treatment (yes/no). The model is below. 


Vij = Bo; trj 


Bo; = Yio (Men-Yes) + Yo (Men-No) + Yso (Women-Yes) + Y4 (Women-No) + uoj 


In summary, to conduct these types of models the following conditions need to 
met: 
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1. All observations need to be classified unambiguously into a single category. 
There can be no overlap or dual-classification, nor can observations not be repre- 
sented. 

2. All categories (dummy-codes) need to be included in the model, and the intercept 
needs to be dropped. 


Interactions, Moderation, and Mediation 


In discussing moderation and mediation, I rely on the classic distinction discussed by 
Baron and Kenny (1986). Although their analytic framework has been criticized (and 
various alternatives have been proposed), Iam more concerned with the conceptual issues 
they raised. Moderation exists when the relationship between two variables varies as a 
function of a third variable, whereas mediation is said to occur when a third variable 
explains the relationship between two others. Within the multilevel context, moderation 
is typically examined by some sort of interaction, and cross-level interactions are some- 
times referred to as moderating relationships. 

Interactions (and, by extension, moderation) can also be examined within levels of 
analysis. For guidance about how to do this, consult Aiken and West (1991). The same 
techniques can be used to examine within-level interactions at any level of analysis, albeit 
with some differences reflecting the need to group-center variables at lower levels of 
analysis. An example of a within-level interaction is presented in the section below on 
conceptualizing the multilevel structure. The models described there can be used to deter- 
mine whether a level-1 slope (e.g., a mood-stress relationship) varies as a function of (is 
moderated by) a categorical level-1 variable (e.g., weekday vs. weekend). 

It is also possible to examine within-level interactions between two continuous vari- 
ables, as in Nezlek and Plesko (2003). In this study, measures of self-esteem and positive 
and negative events were collected each day. The analyses examined a buffering effect— 
the possibility that relationships between self-esteem and negative events would be weaker 
on days when more positive events had occurred than on days when fewer positive events 
had occurred. The interaction term was created by centering each event score within 
each individual (subtracting the mean for each person from each of his or her observa- 
tions), then multiplying these centered scores. The resulting interaction term was entered 
as uncentered (because the measures on which it was based had been centered), and the 
two event scores (positive and negative) were entered group mean-centered. The resulting 
interaction term was significant, and estimated values for self-esteem on days that were 
+ 1 SD on positive and negative events (based on the within-person SD for events) indi- 
cated that positive events did buffer the effect of negative events on self-esteem. 

Evaluating mediation within the multilevel context (particularly at level 1) is much 
more complex, and techniques to do this are being developed. Regardless of the specific 
method or focus, I recommend caution when evaluating mediation on the basis of changes 
in variance estimates. As noted in the section on effect sizes below, error variances in the 
multilevel context are not interchangeable with residual variance estimates within the 
OLS context. Moreover, when considering how a series of level-2 measures might medi- 
ate each other as predictors of a level-1 coefficient, it must be kept in mind that variance 
estimates (no matter how flawed they might be) can be used only when a coefficient is 
modeled as randomly varying. For an informed discussion of mediation within the mul- 
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tilevel context, see Bauer, Preacher, and Gil (2006), Card (Chapter 26, this volume) and 
Eid, Courvoisier, and Lischetzke (Chapter 21, this volume) for discussions of evaluation 
of mediation using structural equation modeling (SEM). 

At level 2 (or at the highest level of a model), I think it is safe to follow the guidelines 
originally offered by Baron and Kenny (1986). A variable is said to mediate the relation- 
ship between two others if it is related to the predictor, and if, when added to the model, 
the original predictor becomes nonsignificant and the mediator is significant. This begs 
the question of partial mediation, which relies upon changes in variance estimates, which 
may be problematic. See Bauer, Preacher, and Gil (2006) for a discussion. 

The most complex situation is lower-level mediation, in which a level-1 variable 
mediates the relationship between two other level-1 variables. The possibility of moder- 
ated mediation (i.e., that mediated relationships themselves might vary as a function of 
level-2 variables) also needs to be taken into account. The most thorough treatment of 
this I have encountered is that provided by Bauer and colleagues (2006). Space does not 
permit a detailed description of their strategy, but they devised a way to estimate the 
direct and indirect effects of a mediator on an outcome, taking into account the possibil- 
ity that these effects might vary across units of analysis. Their technique is a bit complex, 
and it may take those who are not experienced modelers a bit of time to understand how 
to apply to their own data, but it appears at present to be the best available method. 


Conceptualizing the Multilevel Structure: 
What Should Be Nested within What and When? 


In most instances, the multilevel structure of a dataset can be determined straightfor- 
wardly (e.g., in a diary-style study, days or interactions might be treated as nested within 
persons). Nevertheless, there may be instances in which the structure is not clear, or when 
different ways of structuring the data seem reasonable. For example, should a study in 
which multiple observations are collected each day (e.g., a beeper study) be conceptual- 
ized as a three-level model (observations within days within persons) or as a two-level 
model (observations within persons, ignoring days)? 

As discussed previously, within a multilevel data structure, a level of analysis rep- 
resents a sample of some kind, and when considering whether a level of analysis should 
be included in a model, the extent to which a sample would be adequately represented 
by that level needs to be taken into account. For example, if multiple observations were 
collected each day for a sample of people, but only 2 days of data were collected for 
each person, nesting observations within days and days within persons would mean that 
there would be only two units of analysis at the day level (only 2 days for each person). 
Two days do not provide a basis for estimating the day-level variance, so in such a case, 
the day-level variance could not be modeled. Moreover, unless an analyst was willing to 
assign some meaning to the fact that 1 day of data collection was the first day and the 
other day was the second (treating the days as fixed effects), there would be no basis on 
which the data from different people could be organized together in terms of the day on 
which data were collected. In such a case, observations would simply be nested within 
persons. 

Such an example raises the issue of how many units are needed to constitute a level 
of analysis. Unfortunately, to my knowledge, there are no clear or firm guidelines for 
this. Given this, I recommend thinking of each level of analysis as representing a sample, 
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then thinking of how many observations are needed to provide a basis for making an 
inference to a population. Certainly two observations are too few, and 15 are probably 
adequate for most purposes. The exact number will depend on the parameters of interest. 
Because means are more reliable than covariances (slopes), estimating means reliably will 
take fewer observations than estimating slopes reliably. Regardless, when there are not 
enough units to constitute a level of analysis, but there are enough that an analyst does 
not want to ignore the matter completely (e.g., 5 days in a situation such as the example 
just discussed), the level can be included in the model with the understanding that it will 
not be possible to estimate reliably the variance at that level of analysis. 

A similar but slightly different situation arises when units of analysis can be distin- 
guished or classified in terms of a fixed number. For example, assume a study in which 
observations are collected each day, and hypotheses concern differences between week- 
days and weekends. Should such data be analyzed with a three-level model, days nested 
within type of day with persons? Not really. First, there is the fact that the “middle” level 
of analysis would have only two possible units (a day can occur only during the week 
or on a weekend) and, as discussed earlier, two does not provide a basis for drawing an 
inference to a population. But more important, the categories of weekday and weekend 
are not drawn from a universe of types of days—there are only two possibilities. 

In such cases, type of day should be treated as a day-level variable, and differences 
between weekends and weekdays can be examined using two dummy-codes, one repre- 
senting each type of day (see the previous section on coding). To compare day-level means 
the following model could be run. Wday is a dummy-coded variable indicating whether 
a day is weekday or not, and Wend is the corresponding variable for weekend days. Note 
that the level-1 model has no intercept, and predictors are entered uncentered. The dif- 
ference between means for the two types of days is tested by constraining the y,9 and ¥, 
coefficients to be equal. 


Yj = By; (Wday) + Bz; (Wend) + fij 
By = Yio + 44; 


Ba; = Vo + 4) 


Differences between weekends and weekdays in the relationships between a day-level 
predictor and an outcome can be examined with an extension of this model. The first 
step in this analysis is to multiply Stress by the dummy-codes for weekdays and week- 
ends. All four measures are entered into a zero-intercept level-1 model. The dummy-codes 
themselves are entered as uncentered, and the two stress measures are entered group 
mean-centered. It is also possible to group center the stress measures before multiplying 
them, then enter the products uncentered. Regardless, the difference between the Stress 
slopes for the two types of days is tested by constraining the Y2 and y4, coefficients to be 
equal. 


y= Bi (Wday) + Bo; (Wday*Stress) + B3; (Wend) + B4; (Wend*Stress) + r; 
By; = Yio + Mj 
Ba; Vo + Ua; 
Bs; = Y30 + 43; 


Ba; = Yao + 44; 


Il 
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The same logic holds for other levels of analysis. Assume our typical study of days 
nested within persons, and hypotheses concern differences between people, such as sex 
differences or differences as a function of some experimental manipulation. It would 
not be appropriate to add a third level of analysis and nest persons within sex or within 
an experimental condition. Such variables are best represented as person-level variables, 
and sex differences and differences between conditions can be represented using dummy- 
codes in a fashion that is structurally similar to the level-1 example just presented. 


Standardizing Measures 


In comparison to most OLS analyses, MLM analyses provide only unstandardized esti- 
mates of coefficients. Nevertheless, many analysts (and reviewers and editors) want (or 
expect) a description of relationships that are standardized in some way. Although some 
have proposed ways of standardizing MLM coefficients using estimates provided in 
MLM analyses (e.g., dividing coefficients by variance estimates), my sense is that such 
techniques may not be justified statistically. At present, I recommend standardizing mea- 
sures in advance and using the unstandardized coefficients provided by the analyses. 

The most straightforward situation is standardizing measures at the highest level 
of analysis (e.g., level 2 in a two-level model). For many ambulatory assessment studies, 
this will be the person. So, if days were nested within persons (or if observations were 
nested within days that were nested within persons, as in a three-level model), individual 
differences such as personality or physiological measures could be standardized across 
persons. In turn, the unstandardized coefficients produced by the MLM analyses (at 
this level of analysis) would then be functionally standardized (i.e., a 1 unit change in a 
coefficient would represent the change associated with a 1 SD change). Moreover, stan- 
dardizing measures in this way makes it easier to interpret comparisons of the strength 
of coefficients, as discussed in the section on comparing fixed effects. Of course, if such 
measures were not distributed normally, they might need to be transformed and then 
standardized. Analysts need to make such decisions based on the measures at hand and 
the norms within their disciplines. 

Decisions about standardizing at lower levels of analysis (e.g., level 1 in a two-level 
model or levels 1 and 2 in a three-level model) are a bit more complicated. One option 
is to standardize across the entire population of observations. For example, in a diary 
study in which observations were nested within days, a day-level measure such as daily 
mood could be standardized in terms of all days, ignoring the nested structure of the 
data. Moreover, standardizing all of the day-level measures in such a study would have 
the advantage of equating the total variance of different measures, which may make the 
interpretation of coefficients easier. 

Nevertheless, it is critical to keep in mind that such population standardization at 
level 1 does not produce standardized coefficients in the same sense that standardizing at 
level 2 does; that is, level-1 slopes from such data do not represent a change in a predictor 
associated with a 1 SD increase in level-1 predictors. This is because standardizing in this 
way sets the total variance (the sum of the between and within variances) to 1.0, and it 
does not make all the level-1 variances the same for all level-2 units. In a diary study, it 
would not set to 1.0 the within-person standard deviation for each participant. Within- 
person standard deviations would still vary. It is possible to standardize measures within 
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each level-2 unit of analysis (e.g., each person), but such standardization is not recom- 
mended, because it may eliminate important aspects of the data from the model. 

For analysts who want to express their results in terms of some sort of standard 
metric, I recommend generating predicted values for units of observation + 1 SD from the 
mean. For level-1 measures, the within-person standard deviation can be estimated by 
taking the square root of the level-1 variance from an unconditional model (whether the 
measure has been population standardized or not), and then multiplying scores represent- 
ing + 1 SD by the coefficients corresponding to each measure. 


Understanding Effect Sizes within the Multilevel Context 


For analysts accustomed to estimating effect sizes within the OLS context by relying on 
variance estimates, it may be difficult to understand the difficulties inherent in using vari- 
ances to estimate effect sizes within the multilevel context. Moreover, different research- 
ers have approached the subject differently, adding to the confusion. My recommendation 
is conservative and reflects the advice of Kreft and de Leeuw (two well-respected model- 
ers), who noted, “In general, we suggest not setting too much store by the calculation of 
R3 [level-2 variance] or Ri [level-1 variance]” (1998, p. 119). 

Within the multilevel context, the standard method (to the extent that a standard 
exists) of estimating effect sizes is similar to that used within the OLS framework. The 
residual variances from models with different predictors are compared, and the reduc- 
tion in variance from a model with added predictors is subtracted from the residual vari- 
ance of the “base” (or previous model) this difference is then divided by the variance in 
the base model. The resulting quotient is the percent of residual variance that has been 
“explained” by the added predictors. Moreover, such a procedure can be applied to both 
level-1 and level-2 variance estimates. 

At first glance, this sounds all well and good—within the multilevel context one 
follows the same procedures as in the single-level (OLS) context. Unfortunately, in terms 
of simplicity, there are some important caveats that need to be discussed. First, keep in 
mind that for MLM, it is possible to add a significant level-1 predictor to a model that 
does not decrease the level-1 residual variance—something that cannot happen in OLS 
analyses. In contrast to OLS procedures in which the significance tests of coefficients are 
based on variance estimates, in MLM, variances are estimated separately from tests of 
fixed effects. Such a possibility calls into question the accuracy of using variance reduc- 
tions as a means of estimating effect sizes. Second, it is not possible to use variances to 
estimate effect sizes when predicting fixed effects. For example, if a slope is not mod- 
eled as random, then there is no residual variance for that slope, making it impossible 
to calculate a reduction in variance for that slope. Third, when estimating effect sizes at 
level 1, predictors should be entered as group mean-centered. If predictors are not group 
mean-centered, then the level-1 variance estimates will contain level-2 variance of those 
predictors, undermining the validity of relying upon a reduction in level-1 variance (Kreft 
& de Leeuw, 1998, p. 119). 

At this point in time, my sense is that this issue is far from resolved, and I urge 
analysts to be cautious when using reductions in variances to estimate effect sizes. It 
seems that more caution is necessary at level 1 than at level 2. More specifically, using 
reductions in variances to estimate effect sizes may be unreliable when level-1 predictors 
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are not entered group mean-centered and when there are multiple level-1 predictors. At 
times, editors and reviewers may (blindly) insist on including effect sizes based on vari- 
ance estimates, irrespective of the possible problems inherent in such a procedure. In such 
cases, authors have little choice but to provide such estimates, although they may want to 
discuss in passing the possible problems inherent in such a procedure. 

Depending on the circumstances, analysts may be able to describe the strength of 
the relationships they have found using other procedures. For example, the magnitude 
of coefficients can be evaluated in terms of the standard deviation of the outcome: How 
many standard deviations does a slope represent? Such a criterion can be applied to either 
level-1 or level-2 slopes, although when predicting level-1 slopes, it will be difficult to 
estimate the standard deviation of a slope that is not modeled as randomly varying. 
Regardless, it is important to keep in mind that the variance estimates produced in MLM 
analyses are not the same as those produced in OLS regression-style analyses. 


Estimating Reliability of Scales 
Administered on a Repeated Basis 


A defining characteristic of ambulatory assessment studies is the fact that measures are 
collected on repeated occasions from the same person, and such measures may consist of 
multi-item scales (e.g., a three-item measure of anxiety is administered every day). When 
this is the case, researchers may want to know how reliable a scale is; that is, how consis- 
tently does a set of items measure the same construct? Note that this does not address the 
issue of validity, which concerns the correspondence between the construct being mea- 
sured and the construct that is intended to be measured; rather, it concerns the multilevel 
equivalent of a Cronbach’s alpha. 

When thinking of estimating reliability within the multilevel framework, I think it 
helps to think that the reliability of a set of items is like an estimate of the average correla- 
tion between pairs of those items. The reliability of a scale for which the items are highly 
correlated with each other will tend to be higher than the reliability of a scale for which 
the items are not so highly correlated with each other. Let’s consider the simple example 
of a study in which a scale is administered once a day for multiple days. 

With this in mind, it may be helpful to start with descriptions of how such reliabili- 
ties should not be estimated. First, the reliability of the intercept from an unconditional 
model of mean scale scores is not the item-level reliability of the items making up the 
scale. Within MLM, the reliability of an intercept indicates how consistent responses 
are within level-2 units—in this case, how consistent the scale scores (the means of the 
items) are within an individual. By definition and design, scale means ignore the within- 
occasion variance of items and ignore the multilevel equivalent of within-person correla- 
tions among the items. Consequently, the reliability of the intercept provides no informa- 
tion about the day-level consistency of responses to individual items. Second, it is wrong 
to calculate within-person means for a set of items, then calculate a Cronbach’s alpha 
using these mean scores at the person level. Such a procedure confounds within- and 
between-person variance, and provides an estimate of nothing. Finally, it is wrong to cal- 
culate a Cronbach’s alpha for each day of a study, then combine these alphas somehow. 
This assumes that Day 1 for Person 1 should be matched with Day 1 for Person 2, and so 
forth. Keeping in mind the idea that reliability conceptually resembles a correlation, such 
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a procedure would be similar to calculating a correlation between two variables for each 
day of a study, then averaging those correlations to estimate the relationship between 
the two variables. Moreover, what happens when participants have different numbers of 
days, or when days cannot be matched? 

The proper way to estimate the item-level reliability of a set of items within the 
multilevel framework is to add to the model what is sometimes referred to as a measure- 
ment model. For the typical diary study, this is done by nesting items within occasions 
of measurement, then nesting occasions within persons. The model equations for such an 
analysis are below, and an example of this procedure can be found in Nezlek and Gable 
(2001). 


Item level (Level 1) yin = Moje + Eijk 
Day level (Level 2) To = Boog + Yojk 


Person level (Level 3) Boog = Yooo + Hook 


In this analysis, i items are nested within j days, which are nested within k persons. 
The item-level reliability of the scale is the reliability of the level intercept (r). This is an 
indication of how consistent the responses are within days (and within persons)—the 
multilevel equivalent of a Cronbach’s alpha at the item level. An example of the level-1 
data structure for such an analysis is presented below. In this example, data are presented 
for two persons (A and B), for 2 days (1 and 2), for a three-item scale. The column labeled 
“Resp” contains the response for a specific item. 


Person Day Item Resp 
A 1 1 3 
A 1 2 4 
A 1 3 4 
A 2 1 4 
A 2 2 3 
A 2 3 5 
B 1 1 2 
B 1 2 1 
B 1 3 3 
B 2 1 2 
B 2 2 2 
B 2 3 1 


There are a few caveats for such analyses. First, just as is required in a standard 
(single-level) reliability analysis, all items need to be scored “in the same direction.” Items 
that are meant to be reverse-scored before computing scale scores need to be reversed 
before estimating the reliability. Second, when a study has multiple scales, I recommend 
estimating the reliability of each scale separately. When multiple scales are analyzed 
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together using this procedure, the reliability of each scale is influenced by the reliability 
and number items of the other scales in the analysis. A more detailed explanation of this 
is beyond the scope of this chapter, but it can be found in Nezlek (2010), and another 
approach to estimating reliability (based on SEM) can be found in Shrout and Lane 
(Chapter 17, this volume). The approach Shrout and Lane advocate assumes that occa- 
sions of measurement are fixed (rather than random), an assumption that may hold under 
certain circumstances. 


Analyzing Multiple Outcomes Simultaneously 


With a simple extension, adding a measurement as a level-1 dataset (e.g., nesting items 
within occasions and occasions within persons) can be used to analyze multiple outcomes 
simultaneously. One of the important advantages provided by such an analysis is the abil- 
ity to compare the strength of the relationships different outcomes have with the same 
predictor. Assume a daily study in which three measures are collected each day—stress, 
depression, and anxiety—and anxiety and depression are measured with two items each. 
Note that to do this type of analysis, outcomes need to be measured with more than one 
item (more than one indicator in formal terms). 

Further assume that hypotheses of interest concern differences in mean levels of 
anxiety and depression, and differences in the strength of the relationship between stress 
and depression and the relationship between stress and depression. One way to test these 
hypotheses would be to run separate models for anxiety and depression, and compare 
(somehow) the resulting coefficients. The critical word in this sentence is somehow. It is 
very difficult (and requires making assumptions that cannot be easily defended) to test 
the strength of coefficients generated by separate models. 

In contrast, if a measurement level is added as level 1 to a two-level model (items 
nested within occasions, occasions nested within persons—structurally the same as the 
earlier reliability analyses), the comparisons mentioned above are straightforward. Let’s 
assume anxiety and depression are measured with two items each. The critical features 
of such an analysis are that at level 1, a dummy-coded variable is added, representing 
each outcome (Anx and Dep), and the intercept is dropped from the level-1 model. Such 
a level-1 model “brings up” to level 2 an estimate of the mean score for the items for each 
outcome. The model equations for such an analysis are below: 


Item level Vik = Mp (Anx) + Tjk (Dep) + ijk 


Day level Anxiety) Tije = Biok + ajk 


Person level (Anxiety) Biok = Yıoo + “10k 


( 
(Depression) yj, = Boog + "ak 
( 
(Depression) Por = Yooo + “208 

Similar to the reliability analysis described earlier, in this analysis, there are i obser- 
vations nested within j days, which are nested within k persons. As presented in the 
sample dataset below, for the measurement level (level 1) of the analysis, there are a total 
of four observations for each day, one observation for each of the items of the two mea- 
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sures. In addition to the data identifying each observation, for each level-1 observation 
there are three measures: a dummy-code indicating whether the item is an anxiety item 
(Anx), a dummy-code indicating whether the item is a depression item (Dep), and the 


response itself (Resp). 


When Anx and Dep are entered uncentered as predictors of Resp, the T; and n,,; 
coefficients become estimates of the means for the three anxiety and three depression 
items, respectively. Mean levels of anxiety and depression can be compared by constrain- 
ing the Yon and Yo9 coefficients to be equal. 


Person Day Anx Dep Resp 
A 1 1 0 4 
A 1 1 0 5 
A 1 0 1 2 
A 1 0 1 3 
A 2 1 0 5 
A 2 1 0 7 
A 2 0 1 3 
A 2 0 1 6 
B 1 1 0 4 
B 1 1 0 5 
B 1 0 1 2 
B 1 0 1 3 
B 2 1 0 5 
B 2 1 0 7 
B 2 0 1 3 
B 2 0 1 6 


This model can be expanded by adding predictors at different levels. For example, 
daily stress could be added at the day level (which is now level 2). The model equations 


for such an analysis are below: 


Item level 


Day level 


Person level 


~~ nn[n 


Anxiety) 


Anxiety) 


Vijk = Mk (Anx) + Tjk (Dep) + Eijk 


Ti = Biog + Birg (Stress) + rajk 


Depression) Tg = Baor + Borg (Stress) + rajk 


Biog = Yıoo + “102 


Anx-Slope) Pirk = Yio + “11k 
Depression) By = Y200 + “20g 


Dep-Slope) Pik = Yaıo + Uzak 


378 DATA-ANALYTIC METHODS 


The strength of the relationship between stress and anxiety can be compared to the 
strength of the relationship between stress and depression by constraining the y,,;) and 
Yo Coefficients to be equal. Predictors can also be added to the person-level model to 
model individual differences in day-level relationships, and using constraints similar to 
those used in the previous examples, the strength of these person-level relationships can 
be compared. 


Nonlinear Outcomes 


The previous discussion has concerned linear (continuous) outcomes, but it is not 
uncommon for diary-style research to concern nonlinear outcomes such as categorical 
responses. A simple example of this would be a dichotomous outcome: Did an individual 
experience a certain event during a day? More complicated examples include categori- 
cal outcomes with more than two categories, which might or might not be ordered (i.e., 
different categories might indicate more or less of an underlying variable), count data 
that are not normally distributed (how many times a day something occurred), and so 
forth. 

Analyzing such outcomes follows the same logic and relies upon models similar to 
the analyses of linear outcomes, but there are important differences. These differences 
are due to the fact that nonlinear outcomes violate one of the most important assump- 
tions underlying MLM analyses of linear outcomes—the independence of means and 
variances. For example, the variance of a binomial measure is Npg, where N is the num- 
ber of observations, p is the probability of an occurrence, and q is equal to 1 - p. This 
lack of independence means that if nonlinear outcomes are treated as linear, the resulting 
parameter estimates are likely to be inaccurate. Multilevel analyses of nonlinear out- 
comes are the multilevel equivalent of single-level logistical regression, and such analyses 
are sometimes referred to as multilevel logistical regression (MLR). 

Detailed discussion of the available options to conduct MLR is beyond the scope 
of this chapter, but a few details merit attention. Most fundamentally, to eliminate the 
dependence of the variance on the mean, in MLR analyses, the level-1 model includes 
what is essentially a transformation function for the outcome. The specific function var- 
ies as a function of the nature of the nonlinearity of the outcome. For example, for a 
dichotomous outcome, a Bernoulli model with z = 1 can be used. The level-1 model for 
such an analysis is below. 


Prob(y = 11Bo,) =o 


In this model, a coefficient, , representing the probability of y is then converted to 
an expected log-odds (Log[/(1 - 6)]), and an expected log-odds is estimated for each 
level-2 unit (person). How such analyses are described varies considerably across disci- 
plines, and analysts will need to take into account the norms for their disciplines as they 
prepare papers. 

A second issue that arises in some MLR analyses is the target of inference of the 
estimated coefficients. The critical issue is the extent to which an analyst is interested in 
estimating the population mean of a coefficient versus differences among level-2 units in 
coefficients. If estimating population averages is more of a focus (e.g., on what percentage 
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of days people report being depressed), than what are called “population average” esti- 
mates are probably more appropriate. In contrast, if the focus is on individual differences 
in how often (percentage of days) people report being depressed, “unit-specific” estimates 
might be more appropriate. Blanket recommendations about which of these two will be 
appropriate for a given circumstance are not possible, and analysts are encouraged to 
consult published research in their home disciplines for guidance about selecting a set of 
parameters upon which they will rely. 

Fortunately, building models (adding and centering predictors, making decisions 
about error structures, etc.) for MLR is very similar to building models for analyses 
of linear outcomes. Nevertheless, when interpreting the results of MLR analyses, it is 
important to keep in mind exactly what a coefficient represents, which can be compli- 
cated at times. For example, when the outcome is multinomial with k categories, there 
will be k - 1 functions representing the log-odds of being in one category relative to being 
in what is called the “reference category.” So, if participants can choose among yes, no, 
or maybe for a daily measure, then one function will represent the odds of responding 
“yes” relative to responding “maybe,” and a second will represent the odds of respond- 
ing “no” relative to responding “maybe.” Furthermore, level-2 differences in these coef- 
ficients need to be interpreted with this in mind. For analysts who are not familiar with 
logistical regression, interpreting the results of MLR can be particularly challenging and 
time-consuming. Another feature of MLR is that there is no level-1 variance estimate. 
Given the nature of the distributions of the outcomes, there cannot be. 


Software Options 


The increased interest in and use of multilevel analysis have been accompanied by an 
increase in the number of programs that can perform MLM, and such programs can be 
thought of in terms of two broad categories. There are “general-purpose” programs that 
can perform MLM and a wide variety of other analyses (e.g., Statistical Analysis Software 
[SAS], Statistical Package for the Social Sciences [SPSS]), and there are “single-purpose” 
programs designed to do only MLM (e.g., Hierarchical Linear Modeling [HLM]: Rauden- 
bush, Bryk, Cheong, & Congdon, 2004; MLwiN: Rasbash, Steele, Browne, & Goldstein, 
2009). There is broad agreement (a functional consensus) regarding the computational 
algorithms underlying MLM analyses, so different programs provide the same results 
when the same models are specified because they use the same algorithms. Note, how- 
ever, the phrase “when the same models are specified.” Analysts who are not experienced 
with MLM, or not experienced with how to conduct MLM using a specific program, may 
have trouble specifying the model they wish to test. 

For analysts who are not that familiar with MLM, I recommend using a single- 
purpose program such as HLM. Setting up models and interpreting outputs tend to be 
more straightforward when using single-purpose programs than when using general- 
purpose programs. Because single-purpose programs were designed to do only MLM 
analyses, the interface and output have been designed to set up MLM models and display 
the results of MLM analyses directly and efficiently. In contrast, for general-purpose 
programs, setting up MLM models and displaying the results of MLM analyses are just 
two of many possible analyses, so user interfaces have not been designed specifically for 
MLM. Of the single-purpose programs available, I have found HLM to be particularly 
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accessible in terms of setting up models and interpreting output. For example, in HLM, 
there is no need to create different centered versions of predictors. The centering is done 
by the program when the model is specified. 

On the other hand, general-purpose programs offer at least two advantages over 
single-purpose programs. First, data preparation and analyses are all done with the same 
program. Single-purpose programs tend to have fewer options to transform data than do 
general-purpose programs (e.g., there are no data transformation options in HLM), and 
some analysts find it bothersome to transform data “outside” of a single-purpose program, 
then bring the data into that single-purpose program. Second, general-purpose programs 
may have more options for certain types of sophisticated analyses than do single-purpose 
programs. Such possibilities are particularly the case for SAS, in which analysts can 
combine PROC MIXED with other procedures and conduct advanced analyses, such as 
“Mixture Models,” in which similarities among error structures are then used as a basis 
for categorizing observations (in our case, respondents). Moreover, general-purpose pro- 
grams such as SAS tend to provide more options for modeling complex error structures, 
although it is important to note that MLwiN, a single-purpose program, can also model 
some fairly sophisticated error structures. 

When considering software options, I recommend that analysts make certain they 
fully understand all the components of the output of their programs (e.g., where exactly 
are the results of the tests of the fixed and random effects?). Although such a recom- 
mendation may seem foolish, I have reviewed papers and heard talks in which authors 
and speakers have focused on or discussed aspects of their outputs that were not truly 
relevant to the substantive questions at hand. Moreover, I suspect that such problems are 
more likely with general-purpose program than with single-purpose programs because 
of the differences in how results of analyses are organized in different types of programs. 
Finally, by their nature, multilevel analyses tend to be more complex than single-level 
analyses, and I recommend that analysts proceed cautiously when they consider adding 
sophisticated options. Less is more not only in terms of the number of predictors in a 
model, but also in terms of the sophistication of the model in the general sense. 


Guidelines for Publication 


Research using intensive repeated measures designs is conducted by scholars in a wide 
variety of disciplines. Norms about how to present the results of analyses probably vary 
widely, and the following guidelines need to be considered in light of such differences. As 
concisely as possible, readers should be given information that will allow them to under- 
stand the data structure, the models used, and the results. 


1. The structure of the data should be described unambiguously. What is nested 
within what? The numbers of observations at each level of analysis should be presented, 
and the distribution of observations at each level should be provided. If days are nested 
within persons, the mean number and standard deviation of the days (per person) included 
in the analyses should be presented. If observations are also nested within days, the mean 
number and standard deviation of the number of observations per day should be pre- 
sented. Criteria for including or excluding cases (persons, days, observations, etc.) should 
be described explicitly, and the number of excluded cases should be presented. 
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2. Multilevel summary statistics (means and variance estimates for each level of 
analysis) should be provided for all measures. Such statistics provide a context for under- 
standing results. Similarly, univariate summary statistics should be presented for mea- 
sures that were not nested. 


3. When presenting results, in addition to the p-value, because t = y/SE, it suffices 
to present two of the following: y (or B), the t-value, and the standard error. I present y 
and ¢. 


4. At present, I recommend presenting the equations representing the models to 
clarify what predictors were included at each level of analysis. Moreover, a la Bryk and 
Raudenbush, I recommend presenting the equations for each level of analysis separately. 
Such separation emphasizes differences in phenomena across levels of analysis and high- 
lights cross-level effects, something that is particularly helpful for readers who are not 
familiar with multilevel modeling. 


5. How predictors were centered should be described explicitly (including uncen- 
tered). Results (coefficients) cannot be understood without knowing how predictors were 
centered. Do not assume that readers will know how predictors were centered. 


6. The error structure of the model should be described, including a description 
of the basis used to fix effects that were not modeled as random. Furthermore, unless 
hypotheses explicitly concern error structures per se, detailed discussions of error struc- 
tures are probably not necessary. 


7. I strongly recommend that authors present predicted values when describing 
results. This can be very helpful to readers when models are complex, such as when 
there are cross-level interactions. For continuous measures, the standard is to generate 
predicted values for observations + 1 SD. For categorical measures, predicted values can 
be generated for observations that fall into different categories. 


8. Although it is a tradition in some disciplines, I do not necessarily see the value 
in presenting sequential tests of model fits. The emphasis in MLM (compared to SEM) 
is on the fixed effects rather than on the fit of the model as a whole. Moreover, most 
hypotheses concern fixed effects, and indices of model fits include both fixed and random 
components. When sequential comparisons provide insights, they are certainly valuable. 
When they do not, they tend to distract more than they inform. 


Concluding Thoughts 


A chapter such as this cannot cover in detail all the topics that are relevant to using MLM 
to analyze diary-style data. For one, I did not discuss different options for modeling error 
structures. Moreover, the importance of modeling more sophisticated error structures 
than the standard MLM structure (modeling each random error term and all the covari- 
ances among them) will vary across disciplines. For example, autoregressive structures 
seem to be more important for growth curve analyses than for data that are simply col- 
lected across time (e.g., in a daily diary study). There are numerous sources for informa- 
tion about modeling more complex error structures. A good place to start is Hox (2002). 

I also did not discuss lagged analyses. Examining relationships between lagged coef- 
ficients (e.g., a measure at time n and another at time n + 1) can provide some insights into 
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causal relationships between measures. I know of no formal treatment of lagged relation- 
ships within the multilevel framework; however, interested readers can consult Nezlek 
(2002) for an example of lagged analyses, with the understanding that the approach 
taken there is preliminary. Also of note is how I discussed mediation. The topic of media- 
tion per se is a “hot” topic, with all sorts of back and forth on the hows and whys, and 
mediation within the multilevel context is part of this dynamic area of research. The 
Bauer and colleagues (2006) reference represents current good, perhaps best, practice, 
although my sense is that the issue is not fully resolved. 

Regardless, this chapter should serve as a useful introduction to using MLM to ana- 
lyze the data collected in diary-style studies for those unfamiliar with such applications. 
Persons truly and totally unfamiliar with MLM may want to complement their reading 
by consulting some of the references I mentioned at the beginning of the chapter. For 
those more familiar with MLM, some of the techniques and issues I have discussed may 
cause them to reevaluate their present practice. Such reevaluation may not lead to changes 
in practice, but it may lead to more thoughtful application of these techniques. 
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CHAPTER 21 


Structural Equation Modeling 
of Ambulatory Assessment Data 


MICHAEL EID 
DELPHINE S. COURVOISIER 
TANJA LISCHETZKE 


rom a methodological point of view, the major aim of ambulatory assessment is the 

measurement of variability over time. If behavior and feelings did not fluctuate over 
time, ambulatory assessment would be a waste of money because a single measure- 
ment would be sufficient for assessing an individual’s behavior or experience. However, 
not every fluctuation of scores over time indicates that the state of an individual really 
changes. Fluctuations in scores can just be due to measurement error. Because measure- 
ment errors cannot be avoided in the social and behavioral sciences, fluctuations of scores 
over time are at least partly due to measurement error. The crucial question then is to 
what degree behavior and feelings are really variable and how error-free states can be 
measured. Moreover, in order to detect the situational influences on behavior and experi- 
ences, the measurement of error-free, occasion-specific variables that can be related to 
situational characteristics is necessary. Structural equation modeling allows research- 
ers to separate measurement error from true individual scores and is able to distinguish 
between variability that is due to unsystematic measurement error, and variability that 
reflects systematic influences of situations and time. Moreover, structural equation mod- 
eling offers the possibility to calculate coefficients that indicate the psychometric prop- 
erties of measures, such as their reliability, stability, and variability. An introduction 
into structural equation modeling in general is given, for example, by Kline (2010) and 
Schumacker and Lomax (2010). In this chapter we show how structural equation model- 
ing can be applied to ambulatory assessment data. We start with models for a single day. 
We first introduce general models of longitudinal analysis and show how these models 
can be adapted to the situation of individually varying times of observation. Then we 
extend this modeling approach to multiple days, focusing on models that are appropriate 
to analyze interindividual differences; models for single-case data are described by Brose 
and Ram (Chapter 25, this volume). Finally, we show how these models can be used to 
estimate scores of intraindividual variability. 
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Models for a Single Day 


In ambulatory assessment, behavior and experiences are typically sampled several times 
within a single day. A single score on an occasion of measurement characterizes the cur- 
rent state of an individual. Because measurement errors cannot be avoided in ambulatory 
assessment, this observed score can be decomposed into two parts: 


Observed state score = true state score + measurement error (1) 


To separate measurement error from the true state score, it is necessary to have at least 
two indicators measuring the same behavior or experience. This has an important con- 
sequence for planning an ambulatory assessment study. Researchers should include at 
least two items measuring the same construct. If someone, for example, is interested in 
the momentary mood state of pleasantness, two items, such as pleasant and well, should 
be included. 

In structural equation modeling, the association between the true states over time 
can be modeled. Several models that differ in their complexity can be considered. In the 
following, we show step-by-step how a general structural equation model for ambulatory 
assessment data can be defined by taking the different sources of variability into account. 


The Latent Autoregressive Model 


The latent autoregressive model assumes that the true state score, state,,, of an individual 
i on an occasion of measurement ft can be decomposed into two parts: (1) one part that is 
predicted by the true state score, state |), on the directly preceding occasion of measure- 
ment; and (2) one part, up,;, that represents occasion-specific influences that are unpre- 
dictable and due to change (see Figure 21.1): 


state,; = B, + By) © Statey 1); + UDin fort>1 (2) 


where ß, is the intercept, and ß,,_,, is the autoregression parameter. If the autoregression 
parameter is 0, there is no stability over time. Usually, the parameter is positive, showing 
that there is a positive correlation between states over time. If the parameters ß, and B,,,_ 1) 
are time-independent, this indicates a time-homogeneous change process. This implies 
that the prediction of the state scores by the state scores measured at the directly preced- 
ing occasion do not change over time. If one wants to understand the causes of change 
over time, it is important to explain the up, values. A value of 0 indicates that the state 
on t equals exactly the state that is expected based on the state on t — 1. If the score up, 
is positive, this indicates that the state score is larger than expected given the preceding 
state. If up, has a negative value, this shows that the value of the state score is lower 
than expected. In an autoregressive model, interindividual differences in up,; scores can 
be explained by additional variables that have to be measured and integrated into the 
model. 

The latent states are linked to the observed variables in a measurement model. We 
only refer to models with multiple indicators that require at least two indicators for each 
construct on each occasion of measurement. Having multiple indicators is not a necessary 
requirement. If there are at least four occasions of measurement, autoregressive mod- 
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FIGURE 21.1. Latent autoregressive model for six occasions of measurement. STATE,, latent 
state variables; UP, latent residual variables representing unpredictability; E}, error variables; t, 
occasion of measurement; /, indicator; A,,, loading parameters; B,,_,,, autoregressive parameters. 


els can also be estimated with single indicators (Joreskog, 1979). There are also single- 
indicator extensions of the more complex models we propose later (e.g., the trait-state 
error model; Kenny & Zautra, 1995). However, these single-indicator models require 
specific assumptions that are often very restrictive and not fulfilled in reality. Moreover, 
having more indicators enables researchers to test more hypotheses about the psycho- 
metric properties of the measures, as well as about variability and change. Therefore, we 
strongly recommend planning an ambulatory assessment study with at least two indica- 
tors for each construct. 

In the measurement model, we assume that an observed score y,; of an individual i 
on an occasion t with respect to an indicator / can be decomposed as follows: 


Yiri = Oy + Ay > State,; + Ey; (3) 


The parameters of this measurement model have the following meaning. The intercept ©, 
is specific to an occasion of measurement t and an indicator l. For identification reasons, 
it is usually assumed that the mean of the latent state variables is 0. Then, the intercepts 
Q; equal the expected value (mean) of indicator l on occasion t. If the intercept o, of 
an indicator l changes over time, the mean of this indicator changes over time. If the 
parameters Œ, did not differ between the occasions of measurement t, then there would 
be stability on the mean level. This would mean, for example, that the mean state repre- 
sented by an indicator (e.g., average mood state) does not depend on the time of day. This 
hypothesis could be statistically tested by fixing a, to be equal over time and comparing 
this model with a less restrictive model without this equality constraint. 

The parameter A, is the loading parameter. For identification reasons, one loading 
parameter has to be fixed to a specific value for each latent state variable. Usually, the 
loading parameter of the first indicator (/ = 1) is fixed to 1 for all occasions of measure- 
ment and the loading parameters of the other indicators are freely estimated. Differences 
in the loading parameters between the indicators measuring the same latent state variable 
indicate differences in discrimination. Indicators with loadings greater than 1 discrimi- 
nate more strongly, and indicators with loadings less than 1 discriminate less strongly 
than the indicator whose loading is fixed to 1. Differences in discrimination indicate that 
a specific difference between two latent state scores goes along with a larger (expected) 
difference on the observed variable for indicators with a higher discrimination parameter 
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than for indicators with a lower discrimination parameter. The first-order autoregressive 
model is depicted in Figure 21.1 for six occasions of measurement. 


MEASUREMENT INVARIANCE 


In longitudinal studies, a crucial question is whether we measure the same construct over 
time (Cheung & Rensvold, 2002; Meredith, 1993). Given the relatively short time lags 
in ambulatory assessment, it is unlikely that the construct changes over time. However, 
this assumption has to be tested. Different types of measurement invariance can be dis- 
tinguished (Millsap & Meredith, 2007; Widaman & Reise, 1997). In particular, three 
levels of measurement invariance are relevant for our purposes: weak, strong, and strict 
measurement invariance. Weak measurement invariance is given when the loadings ùy of 
an indicator / do not differ over time. This could be tested by fixing the loading param- 
eters A,, to be equal over time. Strong measurement invariance requires, in addition to 
weak measurement invariance, that the intercepts o,, be equal over time. To test strong 
measurement invariance, the loadings A,, and the intercepts at, have to be fixed to be equal 
over time for each indicator /. We have mentioned that, for identification reasons, the 
mean values of the latent state variables are usually fixed to 0. In this case, the intercepts 
equal the mean values of the observed indicators. Fixing the intercepts to be equal over 
time would mean that there is no change in the mean values of the latent states over time. 
However, this would be too restrictive because measurement invariance does not neces- 
sarily imply mean stability over time. Instead, one is often interested in mean change but 
wants to ensure that the measurement structure does not change. To test strong measure- 
ment invariance and allow for mean change, one has to choose another way of identifying 
the mean structure. Instead of fixing the mean value of a latent variable, one can fix the 
intercept of one indicator to a specific value. Usually one fixes the intercept of the first 
indicator to 0. In this case, the mean of a latent state variable equals the mean of the first 
indicator. This shows how one can fix the intercepts to be equal over time (strong mea- 
surement invariance) without fixing the means of the latent states to be equal. Strict (or 
full) measurement invariance holds if in addition the variances of the error variables of 
an indicator / do not differ over time. 


LIMITATIONS AND EXTENSIONS 


There are several limitations and extensions of the latent autoregressive model. First, the 
two indicators might not be perfectly homogeneous, as depicted in Figure 21.1, but they 
might contain an indicator-specific component that can be identified in longitudinal stud- 
ies. If the momentary mood is assessed by two indicators, and one indicator measures 
pleasantness and the other unpleasantness, the two indicators are not perfectly homo- 
genous but each contains a valence-specific component that is not shared with the other 
indicator. This heterogeneity can be considered by including an indicator-specific factor 
for all indicators (with exception of the first indicator) into the model (see Figure 21.2). As 
Eid, Schneider, and Schwenkmezger (1999) have shown, it is sufficient to consider (L - 1) 
indicator-specific factors, where L indicates the number of indicators. One indicator is 
chosen as the comparison standard indicator without an indicator-specific factor, which 
represents that part of an indicator that cannot be predicted by the standard indicator. A 
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FIGURE 21.2. Latent autoregressive model with indicator-specific factor for six occasions of 
measurement. STATE, latent state variables; UP,, latent residual variables representing unpredict- 
ability; E),, error variables; IND,, indicator-specific factor for the second indicator; t, occasion of 
measurement; Í, indicator; Ap, loading parameters for the latent state variables; ù, loading param- 
eters for the indicator-specific factor; ß,,_,,, autoregressive parameters. 


value ind, of an individual i on an indicator-specific factor shows whether this individual 
has a larger, a smaller, or the same value as would be expected based on his or her value 
on the standard indicator. One loading Mp on each indicator-specific factor has to be 
fixed to a number (e.g., 1) for identification reasons. The index I stands for indicator. To 
ensure measurement invariance, all loadings on an indicator-specific factor have to be 
fixed to be equal. Whether an indicator-specific factor is necessary or not can be tested 
by comparing the fit of the model in Figure 21.2 with the fit of the model in Figure 21.1. 
If the first indicator is chosen as comparison standard (as has been done in Figure 21.2), 
then the model is formally defined by: 


Yri = Ap + Ay state; + ep forl=1 (4) 
Yri = Ap + Ày state,; + Ay indy + epi forl#1 (5) 
state; = B, + Bai) © statey.,+ up,  fort>1 (6) 


The second limitation of the latent autoregressive model is the assumption that the 
autoregression parameter B,, ;, has to be the same for all individuals. This assumption 
might be violated in ambulatory assessment by sampling procedures in which the occa- 
sion of measurement is randomly selected for an individual (e.g., in computerized ambula- 
tory assessment). In this case, the time lag between two occasions of measurement might 
differ between individuals. However, because the autoregression parameter depends on 
the time lag and should be larger for shorter time lags (indicating higher stability), the 
autoregression parameter should differ between individuals, and should be a function of 
the individual time lag. We show how such a model can be specified in our section on 
models for individual-varying times of observations. 
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The third limitation of the latent autoregressive model is that the autoregressive pro- 
cess might not correctly represent the process of change. The first-order autoregressive 
model assumes that for predicting a state on occasion £, it is sufficient to consider the state 
measured on occasion (t — 1). All measurements before (t- 1) do not have a direct effect 
on the state on occasion t. This implies a decrease of autocorrelations over time, and for 
a long enough time period the autocorrelations will become 0. Many studies, however, 
have shown that the lower asymptote of autocorrelations is not 0 but a larger value that 
is determined by a trait (Cole, Martin, & Steiger, 2005). Typically there is more stability 
in the process than assumed by the autoregressive model. Given multiple repeated mea- 
surements within a day, this higher stability could be explained by a day-specific level. 
Considering mood, for example, there could be good days and bad days, and each day 
could be represented by a day-specific mood level. The repeatedly measured mood states 
would then fluctuate around this day-specific mood level. A certain day-specific level 
could also be due to stable interindividual differences in the construct considered, such as 
the disposition to a specific mood (e.g., habitual mood level, “set-point”). To incorporate 
this idea, the latent autoregressive model has been extended to a latent autoregressive 
state-trait model. 


The Latent Autoregressive State-Trait Model 


In the latent autoregressive state-trait model, a common day-specific latent variable is 
added to the model. This common day-specific latent variable has an influence on all 
observed indicators (see Figure 21.3). To be in line with the terminology of latent state- 
trait theory, we call this common day-specific latent variable a latent trait variable. It is 
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FIGURE 21.3. Latent autoregressive state-trait model for six occasions of measurement. TRAIT, 
latent trait variable; OCC,, latent occasion-specific variables; UP,, latent residual variables repre- 
senting unpredictability; E,,, error variables; t, occasion of measurement; l, indicator; Arp, loading 
parameters for the latent trait variable; Xop, loading parameters for the occasion-specific variables; 
B,i:1)) autoregressive parameters. 
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important to note that this latent variable measures not a general trait but the day-specific 
disposition to experience a specific feeling or to show a specific behavior. The autoregres- 
sive state-trait model is defined by the following equations: 


Viti = Op + Ary * trait; + Mop + OCC; + Ey; (7) 
and 
cc, = Buy‘ OCCy 1); + UD; fort>1 (8) 


The loadings X, on the trait represent the influence of the trait on the observed variables. 
Again, the loading of one observed variable has to be fixed for identification reasons. The 
trait represents the stable component over time. Including the latent trait variable has an 
influence on the meaning of the latent variables in the autoregressive part of the model. 
The latent variables that have been the latent state variables in the autoregressive model 
are now latent occasion-specific variables. A value occ,; of an individual i on occasion t 
on an occasion-specific variable indicates the momentary deviation of the true state of 
this individual from the value expected given the latent trait score. The occasion-specific 
variables represent the deviations of the momentary states from the day-specific level. 
According to latent state-trait theory (Eid, 1996; Steyer, Schmitt, & Eid, 1999), this 
deviation is due to situational influences and the interaction between the person and 
the situation. The latent trait variable is uncorrelated with the latent occasion-specific 
variables and the residuals on the level of the occasion-specific variables. Because the 
occasion-specific factors are residual factors, their means have to be 0. If the mean of 
the trait factor is fixed to 0, the intercepts equal the expected values of the observed 
variables. If the intercept of the first indicator is fixed to 0 and its loading to 1, the mean 
of the trait can be estimated. It then equals the mean of the first indicator. Because the 
occasion-specific variables are residuals with a mean value of 0, there is no constant in 
the autoregressive part of the model. 

If the observed indicators are not perfectly homogenous, the model has to be 
extended. In general, there are two ways to consider heterogeneity on the level of the 
observed indicators. In the first way, the model is extended by including indicator-specific 
residual factors, as in the autoregressive model. Such a model is depicted in Figure 21.4. 
Just as in the autoregressive model with indicator-specific variables, there are only (L — 1) 
indicator-specific factors, and one indicator has to be chosen as the comparison standard 
(without indicator-specific factor). If one chooses the first indicator as the comparison 
standard, the model is defined by the following equations: 


Yiri = Cy + Arie ` traiti; + hort " OCCy + Ey for ]=1 (9) 
Yri = Oy + Ary ` trait; + Mou OCC, + My indy + ey; for /#1 (10) 
OCC,; = Bret) * OCC iyi + UP fort>1 (11) 


The other possibility would be to define a latent trait factor separately for each indicator. 
Figure 21.5 is a model with indicator-specific trait variables. Each indicator has its own 
latent trait variable, and the latent trait variables are correlated. Given that the observed 
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FIGURE 21.4. Latent autoregressive state-trait model with indicator-specific factor for six occa- 
sions of measurement. TRAIT, latent trait variable; OCC,, latent occasion-specific variables; UP, 
latent residual variables representing unpredictability; E,,, error variables; IND,, indicator-specific 
factor for the second indicator; t, occasion of measurement; l, indicator; Arp, loading parameters 
for the latent trait variable; Xop, loading parameters for the occasion-specific variables; A,,,, loading 
parameters for the indicator-specific factor; B,,_;, autoregressive parameters. 
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FIGURE 21.5. Latent autoregressive state-trait model with indicator-specific trait variables for 
six occasions of measurement. TRAIT, latent trait variables; OCC,, latent occasion-specific vari- 
ables; UP, latent residual variables representing unpredictability; E,, error variables; t, occasion 
of measurement; Í, indicator; Arp, loading parameters for the latent trait variables; Xop, loading 
parameters for the occasion-specific variables; B,,_,,, autoregressive parameters. 
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variables are intended to measure the same construct, the correlations between the latent 
trait variables should be high. The models in Figures 21.4 and 21.5 are data-equivalent if 
one puts special constraints on the parameters of the model (see Eid et al., 1999). 

The advantage of the model in Figure 21.4 is that the indicator-specific residual 
factors measure that part of an indicator that is not shared with the standard indicator. 
If one extends this model by other latent variables, one can systematically analyze the 
relationship between these other latent variables and the common, as well as the specific, 
trait variables. On the other hand, the model in Figure 21.5 has the advantage that it can 
be easily extended to more general models, such as models with growth structures, which 
we will introduce in the next section. 


MEASUREMENT INVARIANCE 


Weak measurement invariance over time is fulfilled if all trait loading parameters and all 
occasion-specific loading parameters belonging to the same indicator are equal over time, 
but the means of the observed variables can vary across occasions of measurement. 

Strong measurement invariance would be given if, in addition, the intercepts belong- 
ing to the same indicator were equal over time. Strong measurement invariance implies 
the stability of the mean scores of the observed variables over time because the means of 
the latent trait variables do not change over time. If strong measurement invariance is ful- 
filled, the latent autoregressive state-trait model is a model of variability: States fluctuate 
around a stable trait, and there are no mean changes over time. Changes are due only to 
occasion-specific influences. Strict measurement invariance, in addition, requires that the 
error variances belonging to the same indicator do not differ over time. 

The model with weak measurement invariance allows the mean scores to change 
over time, but the direction of change is not specified. Sometimes researchers would like 
to know whether there is a trend in the data. For example, there could be a linear trend 
showing that people become happier over the course of the day. There might also be a 
nonlinear trend indicating that people become more awake over the course of the day and 
sleepier in the late evening. If a researcher is interested in these trends, the autoregressive 
state-trait model has to be extended by growth factors representing these trends. 


Latent Autoregressive State-Trait Growth Curve Model 


In Figure 21.6, latent growth factors are added to the autoregressive latent state-trait 
model with indicator-specific trait variables. The factor loadings of a growth factor are 
fixed to specific numbers. The loadings are 0 for the first occasion of measurement, 1 for 
the second occasion of measurement, 2 for the third occasion of measurement, and in 
general (£- 1) for the tth occasion of measurement. This loading structure requires equal 
distance between the assessments. Values of a growth factor are the individual growth 
parameters. These are regression parameters that differ between individuals and repre- 
sent the degree of linear growth (or decline) over time. With respect to a daily assessment 
study, this would mean that the behavior or experience (e.g., pleasant mood) increases or 
decreases over the time course of the day in a linear manner. In addition, the loadings on 
the indicator-specific trait factors are fixed to 1. This means that the trait factors are the 
intercept factors. A value on the trait factor is the individual intercept of a person. The 
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FIGURE 21.6. Latent autoregressive state-trait growth curve model with indicator-specific trait 
and growth variables for six occasions of measurement. TRAIT, latent trait variables; GROWTH), 
growth factors; OCC,, latent occasion-specific variables; UP,, latent residual variables represent- 
ing unpredictability; E,, error variables; t, occasion of measurement; l, indicator; Aom, loading 
parameters for the occasion-specific variables; ß,,_,, autoregressive parameters. 


variances of the trait factors represent interindividual differences in the intercepts, show- 
ing that the individuals differ in their values on the first occasion of measurement. This 
model is defined in the following way: 


Yp; = trait, + (t- 1) - growth), + Mop’ OCC, + Ey; (12) 


OCC = Bun) © OCC y_1); + UD 3; fort>1 (13) 


It is important to note that there is no general intercept œ, in this model. The general 
intercept is the mean value of the trait factor. In the latent autoregressive state-trait 
growth curve model there is a linear trend in the change process as well as variability. 
Figure 21.7 illustrates the types of variability and change inherent to this model for three 
individuals. The true state scores fluctuate around a linear trend line. The linear trend 
line of an individual is determined by the individual values of the trait and the growth 
factor. The deviation of the true state from the line is measured by the value on the 
occasion-specific variable. The autoregressive process is represented by the inertia of the 
fluctuations around the line but is not explicitly visible in Figure 21.7. 


LIMITATIONS AND EXTENSIONS 


The model in Figure 21.6 only makes sense if (1) there is linear growth or decline for all 
individuals, (2) the time lag between two occasions of measurement is the same for all 
individuals, and (3) the time lag between two neighboring occasions of measurement 
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FIGURE 21.7. Latent autoregressive state-trait growth curve model with indicator-specific trait 
variables: Variability and change pattern for three individuals. 


does not change over time. This requires the occasions of measurement to be equidistant. 
These assumptions, however, can be violated. If the change process is curvilinear, the 
model can be extended by factors indicating quadratic (growth?) change. If the change 
process is best described by a cubic model, a further factor (growth?) has to be added, 
and so forth. However, the change process might not follow a specific function of time, 
in which case the growth factor(s) can be omitted. If the occasions of measurement are 
not equidistant and differ between individuals, the model has to be extended to a model 
for individually varying times of observations (see the section “Models for Individually 
Varying Times of Observations”). 


DECOMPOSITION OF VARIANCE 


The model in Figure 21.6 is a general model integrating different types of change (linear 
change, variability). In this model, the variance Var(Y),) of an observed variable Y, on 
an occasion of measurement ¢ can be decomposed into several components (in contrast 
to single values, we denote variables with capital letters and without the index i for indi- 
viduals): 


1. The variance due to the growth curve part (trait factor and growth factor). 
2. The variance due to the occasion-specific variables OCC, 
3. Measurement error. 


Based on these components, several coefficients can be defined: 


e The unreliability coefficient 


gra = er.) (14) 
Var(Y,) 
is the part of the variance of an observed variable that is due to measurement error. 
e The complement 
Var(E,,) 
Rel(Y,)=1-URel(Y,)=1- — U 
e ( A) e ( w) Var(Y,) (15) 


is the reliability coefficient that indicates to which degree observed interindividual differ- 
ences are due to true (error-free) interindividual differences. 
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e The occasion-specificity coefficient 
_ Ny Var(OCC,) 


OSpe(Y,) = Var(Y,,) — Var(E,,) (16) 


is the degree of true variance that is due to occasion-specific influences. 
e The complement 


a2, -Var(OCC,) 
Var(Y,)- Var(E,) (17) 


Con(Y,,) =1- OSpe(Y,,) =1 


is the consistency coefficient indicating the part of the true variance that is due to true 
individual differences, which can be predicted by knowing the individual growth curves. 
If there is no growth factor, the consistency coefficient represents the amount of true sta- 
bility, that is, the degree to which true individual differences on an occasion of measure- 
ment can be predicted by stable trait differences. 


e The unpredictability coefficient 


u Aor ` Var(UP,) (18) 
Var(Y,)- Var(E,) 

is that proportion of the true variance that can be predicted by neither the growth pro- 

cess nor the autoregressive process. It represents the degree to which interindividual dif- 

ferences are due to true unpredictable fluctuations that can be attributed to occasion- 

specific influences. The unpredictability coefficient represents the amount of “pure” true 

variability that does not follow any systematic change process. 


Upred(Y,,) 


e The counterpart to the unpredictability coefficient is the predictability coefficient 
_ Aon Var(UP,) (19) 
Var(Y,)- Var(E,,) 


It indicates the degree to which true interindividual differences on an occasion of mea- 
surement are predictable by either stable differences or change processes. 


Pred(Y,)=1-Upred(Y,)=1 


EMPIRICAL EXAMPLE 


To illustrate the models presented so far, we refer to an ambulatory assessment study 
on the measurement of momentary mood in daily life (Courvoisier, Eid, Lischetzke, & 
Schreiber, 2010). The sample consisted of 305 participants (231 females; 74 males; age: 
mean = 22.79, SD = 5.66). Participants were called six times per day for 7 days on their 
mobile phones by a computer. Participants were called in intervals of 2 hours. If a par- 
ticipant could not answer the first call of each session, he or she was re-called every 30 
minutes during the 2-hour interval until the call was accepted or the session was ended. 
Each participant received CHF 75 (Swiss francs) as compensation. The computer soft- 
ware used to manage the calls was SmartQ (Telesage, 2005). The items were presented 
by a natural prerecorded voice. To give their answers, participants had to press one of five 
buttons (1, 2, 3, 4, 5) with 1 meaning Not at all and 5 meaning Very much. Momentary 
mood was assessed by a short version of a pleasantness—unpleasantness adjective check- 
list (Steyer, Schwenkmezger, Notz, & Eid, 1997) consisting of four adjectives measuring 
the pleasant-unpleasant dimension of momentary mood. Items were glücklich (happy), 
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unglücklich (unhappy), wohl (well), and unzufrieden (dissatisfied). The negative items 
were recoded. The items were grouped into two test halves representing the mean of the 
responses to the two items. Higher values indicate a more positive mood. 

The models presented so far were tested for a single day (Thursday). In all mod- 
els, strict measurement invariance was assumed and, in addition, the intercepts and the 
residual variances in the autoregressive part of the model were set equal over time. The 
analyses were done with the computer program Mplus (Muthén & Muthén, 1998-2007). 
The robust maximum likelihood (MLR) estimator for missing data was used. The fit 
of the different models is summarized in Table 21.1. The autoregressive model without 
indicator-specific factors does not fit the data. All other models fit the data according 
to the goodness-of-fit coefficients. The fit of the different models can be compared by 
information criteria, such as the Akaike information criterion (AIC) and the Bayesian 
information criterion (BIC). According to the AIC and the BIC, the best fitting model 
is the latent autoregressive state-trait model with two latent trait variables. The latent 
autoregressive state-trait growth curve model with indicator-specific trait variables does 
not fit the data better than the model without a growth factor. Moreover, the estimated 
variances of the growth factors have negative values close to 0. This shows that there is 
no linear growth in the data. 

The parameter estimates of the latent autoregressive state-trait model with indicator- 
specific traits are depicted in Figure 21.8. These parameter estimates can be used to 
calculate the coefficients of reliability, unreliability, occasion specificity, consistency, pre- 
dictability, and unpredictability. 

To calculate the unreliability coefficients, the error variances and the variances of 
the observed variables are needed. The error variances are estimated by the program. 
The variances of the observed variables are not the variances in the sample variance- 
covariance matrix, but the variances of the observed variables implied by the model. 
These variances can be requested in most programs for structural equation modeling. 
If they are not reported, they can be calculated for the autoregressive latent state-trait 
model on the basis of the estimated parameters by the equation 


TABLE 21.1. Goodness-of-Fit Criteria for the Different Models with Strict Measurement 
Invariance 


Model x df p CFI RMSEA AIC BIC 
Latent autoregressive model 139.15 84 <.01 0.94 0.05 5974.36 5996.36 
Latent autoregressive model with 97.69 82 .11 (0.98 0.03 5926.48 5955.82 


indicator-specific factor 


Latent autoregressive state-trait model 95.24 82 AS 0.98 0.02 5921.93 5951.26 
with one trait 


Latent autoregressive state-trait model 78.43 81 56 1.00 <0.01 5903.98 5936.98 
with indicator-specific trait variables 


Latent autoregressive state-trait 70.39 72 53 1.00 <0.01 5912.66 5978.65 
growth curve model with indicator- 
specific trait variables 


Note. CFI, comparative fit index; RMSEA, root mean square error of approximation; AIC, Akaike information criterion; 
BIC, Bayesian information criterion. 
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Var(Y,) = Ar, Var(T,)+ A2,,-Var(OCC,) + Var(E,) (20) 


The easiest way to calculate the unreliability coefficient is to use the reliability coefficients 
reported by many programs for structural equation modeling in the standardized solu- 
tion. In our empirical example, the estimated reliability coefficients are between .657 and 
.703, and the estimated unreliability coefficients are therefore between .297 and .343. 
Given that each observed variable consists of only two items, the reliability coefficients 
are rather high. 

The occasion specificity coefficients are calculated using Equation 16 and the vari- 
ances of the observed variables implied by the model. The estimated occasion-specificity 
coefficients are between .471 and .569, showing that about half of the true (error-free) 
interindividual differences found on an occasion of measurement are due to occasion- 
specific influences. Consequently, the consistency coefficients are between .431 and .529, 
indicating that the other half of interindividual differences are due to stable individual 
differences on this specific day. Although the items assessed the momentary mood, the 
large amount of stability in the data show that it makes sense to characterize the mood of 
a person by his or her average day-specific level (see Fleeson & Noftle, Chapter 29, this 
volume). If one also takes the autoregressive structure into account, the unpredictability 
coefficients are even smaller than the occasion-specificity coefficients, and the predict- 
ability coefficients are higher than the consistency coefficients. The estimated values of 
the unpredictability coefficients are between .387 and .421, and the predictability coef- 
ficients are between .579 and .613. These values show that a mood state can be predicted 
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FIGURE 21.8. Estimated parameters of the latent autoregressive state-trait model with 
indicator-specific trait variables for six occasions of measurement. Reported are the unstandard- 
ized parameter estimates and the standardized parameter estimates (in parentheses) according 
to the completely standardized solution. TRAIT), latent trait variables; OCC,, latent occasion- 
specific variables; UP,, latent residual variables representing unpredictability; E,,, error variables; 
t, occasion of measurement; /, indicator. 


398 DATA-ANALYTIC METHODS 


by the day-specific mood level and the preceding mood state. According to these results, 
mood states are far from being unpredictable. The prediction of the current mood of a 
person could even be enhanced by including occasion-specific covariates in the model 
that characterize the situation on an occasion of measurement. 


Models for Individually Varying Times of Observations 


The models presented so far assume that the autoregression parameters are the same for 
all individuals. This makes sense only if the time lag between two occasions of measure- 
ment is the same for all individuals. If the time lag differs between individuals, models 
with autoregressive structures have to take these individual differences into account. This 
could be done by replacing the autoregression parameter B,,_;) by ß'“*. The exponent lag; 
is the individual time lag between t and t — 1 for individual i. An autoregressive latent 
state-trait model with individually varying times of observations (and indicator-specific 
traits) is defined by the equations: 


Yii = Cy + Are ` trait; + Aor © OCC + Cli (21) 
and 
occ,; = Br - OCC it Upu  fort>1 (22) 


The reason for defining the autoregressive structure in such a way becomes clear if one 
considers the general properties of the autoregressive model. Consider that there are three 
occasions of measurement, with a time lag of 1 hour between two occasions of measure- 
ment. The autoregressive part of the state-trait model would then have the following 
structure: 


0€C3; = B32 © OCC; + UP3; (23) 
occz; = By, + OCC); + ups; (24) 
The autoregression parameter B,;,_;) indicates the effect an occasion-specific variable has 
on another occasion-specific variable measured 1 hour later. What would be the effect 


of an occasion-specific variable on another occasion-specific variable measured 2 hours 
later? Inserting Equation 24 in Equation 23 results in 


0€€3; = Bs, * (By * OCC; + UP2;) + UP3; = Bsa * Bay © Occi; + Boz © Up; + MP3; (25) 
Under the assumption that the autoregressive process is time-homogenous (i.e., the 
autoregression parameter is constant across time) one obtains the following equation for 
a time lag of 2 hours: 


occ3; = B+ occ,; + B+ upa; + ups; (26) 


Here, B indicates the autoregression parameter for 1 hour. This shows that if an autore- 
gression parameter represents a specific time lag of one unit, the autoregression parame- 
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ters for other time lags that are a multiple of a unit can be obtained by putting the multiple 
in the exponent of the autoregression parameter. Therefore, defining the autoregression 
parameter by Bs in the model for individually varying times of observations makes sure 
that the individual time lags are appropriately considered. The meaning of an autoregres- 
sion parameter is then the effect of an occasion-specific variable on another one mea- 
sured with a time lag of one unit of the variable LAG, measuring the time lag between 
two occasions of measurement. If the time lag is measured in minutes, the parameter ß 
indicates the autoregressive effect for 1 minute. If the time lag is measured in hours, the 
parameter ß indicates the autoregressive effect for 1 hour, and so forth. It is important to 
note that the model assumes that the autoregressive effect (for one time unit) is the same 
for all individuals considered. The exponent adapts the parameter to the individual time 
lag. The model for individually varying times of observations is not a standard structural 
equation model. To define the model, constraints have to be put on the autoregression 
parameter. We show in the empirical example below how this can be done. 

If the times of observations vary individually, the growth curve structure of a model 
has to be adapted as well. In the linear growth model, the loading parameters of the 
growth factor are fixed to values that are in line with the linear growth process. If the 
occasions of measurement are equidistant, the loading parameters increase by one with 
each occasion of measurement added. This increase makes sure that the growth factor 
represents linear change. If the occasions of measurement are not equidistant, the loading 
parameters have to be adjusted accordingly. If the time lag between the second and the 
third occasion of measurement is only half the time lag between the first and the second 
occasion of measurement, then the loadings have to be fixed to the values 0, 1, and 1.5. 
If the time lags vary individually, the loading parameters have to be a function of the 
individual time lag. In this case, however, it is not the time lag between two neighbor- 
ing occasions of measurement, but the time lag between the occasion of measurement 
considered and the first occasion of measurement. Hence, the loading parameter for the 
third occasion of measurement would be a function of the individual time lag between 
the third occasion of measurement and the first one. A growth structure with individu- 
ally varying occasions of measurement is also a nonstandard structural equation model 
with specific constraints. In the program Mplus, for example, time lag variables can be 
defined as loading parameters for the growth model. We show in the following example 
how models for individually varying times of observations can be specified in the com- 
puter program Mplus. 


EMPIRICAL EXAMPLE 


In the application we presented earlier, the occasions of measurement were randomly 
selected for all individuals. Therefore, the autoregressive latent state-trait model that 
fitted the data well has to be extended by defining the autoregression parameters as func- 
tions of the time lag. Because this is a nonstandard model we explain in more detail how 
this model has to be specified. We refer to the computer program Mplus, which allows 
complex constraints.! To specify an autoregressive latent state-trait model with individu- 
ally varying times of observations, the variables measuring the time lag have to be defined 
under the VARIABLE command with the option CONSTR: 


constr = LAG2 LAG3 LAG4 LAG5 LAG6; 
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The variable LAG2 indicates the individual time lag between the first and the second occa- 
sion of measurement, the variable LAG3 indicates the time lag between the second and the 
third occasion of measurement, and so forth. 

Then the model has to be specified under the MODEL command. The part that is 
constrained refers to the autoregressive structure. The parameters that have to be con- 
strained are labeled by the letter p and a number in parentheses in the line in which the 
parameter is defined: 


OCC6 on OCC5 (p5); 
OCC5 on OCC4 (p4); 
occ4 on OCC3 (p3); 
OCC3 on OCC2 (p2); 
OCC2 on OCC1 (pl); 


For example, the line occ2 on occ1 (p1) defines the autoregression parameter for the 
second occasion-specific variable regressed on the first one (defined by on). This autore- 
gression parameter is labeled p1. To define the constraints on the five autoregression 
parameters, the MODEL CONSTRAINT command has to be added: 


MODEL CONSTRAINT: 
NEW (b); 
pl = b**LAG2; 
p2 = b**LAG3; 
p3 = b**LAG4; 
p4 = b**LAG5; 
p5 = b**LAG6; 


The newly defined parameter b is the general autoregression parameter ß. In the model 
constraint command, it is now assumed that the five autoregression parameters equal 
the same general autoregression parameter b with the respective time lag variable in its 
exponent. 

The results of these analyses are depicted in Figure 21.9. The estimated general 
autoregression parameter is B = 0.984. Because the time lag was measured in minutes, 
this is the autoregression parameter for a time lag of 1 minute. Because the time lag is very 
short, the autoregression parameter is close to 1. The estimated autoregression parameter 
for 60 minutes is 8° = 0.398; for 2 hours it is B!2° = 0.144. In this model, the variances of 
the variables UP, are smaller and the variances of the trait variables are larger than in the 
model without individually varying times of observations. Apparently, intraindividual 
variability is overestimated when individually varying autoregression parameters are not 
considered. However, Mplus reports neither the standardized solution nor the explained 
variances when this type of constraint is defined. It is complicated to calculate the vari- 
ance of an occasion-specific variable explained by a preceding occasion-specific variable 
because of the variable in the exponent of the autoregression parameter. Therefore, the 
coefficients of (un)reliability, consistency, specificity, and (un)predictability cannot eas- 
ily be computed. Because the estimates in Figure 21.8 do not differ much from the esti- 
mates in Figure 21.9, the estimated variance components can be used as rough estimates. 
Another limitation is that Mplus does not report goodness-of-fit statistics for the model 
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FIGURE 21.9. Estimated parameters of the latent autoregressive state-trait model with indicator- 
specific trait variables for individually varying observation times for six occasions of measure- 
ment. Reported are the unstandardized parameter estimates TRAIT), latent trait variables; OCC, 
latent occasion-secific variables; UP,, latent residual variables representing unpredictability; Ej, 
error variables; LAG, time lag between (f- 1) and t; t, occasion of measurement; l, indicator. 


with individually varying autoregression coefficients. However, given that the model with 
constant autoregression coefficients across individuals fits the data well, we could expect 
that this is also true for the model with individually varying autoregression parameters. 

Because the growth model with constant loading parameters across individuals did 
not fit the data and there was no linear growth in the data, we did not specify an autore- 
gressive latent state-trait growth curve model. In general, to specify such a model with 
Mplus, the loading parameters of the growth factor have to be defined as functions of 
the individual time lag between an occasion of measurement and the first occasion of 
measurement. 


Models for More than One Day 


Models for more than one day can easily be specified by integrating the models for differ- 
ent days into one model. Figure 21.10 shows a model for 2 days, in which it is assumed 
that an autoregressive latent state-trait model within each day fits the data. Correlation 
of the latent trait variables is allowed to analyze the stability of the day-specific mood 
level. This model allows us to look at stability and variability on both the within-day 
level and the between-day level. In our application, we specified an autoregressive latent 
state-trait model with strict measurement invariance within each day and added the 
constraint that the estimated within-day parameters (loading parameters; autoregressive 
parameters; means, variances, covariances of the latent variables) do not differ between 
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FIGURE 21.10. Latent autoregressive state-trait model with indicator-specific trait variables for 
two days and six occasions of measurement for each day. TRAIT,, latent trait variables; OCC,,, 
latent occasion-specific variables; UP,,, latent residual variables representing unpredictability; 
Epa Error variables; t, occasion of measurement; l, indicator; d, day; Arpa, loading parameters 
for the latent trait variables; Aora, loading parameters for the occasion-specific variables; B,._1)4 
autoregressive parameters. 


days. This model fits the data very well (x? = 314.088, degrees of freedom [df] = 312, p 
= .456, comparative fit index [CFI] = 0.999, root mean square error of approximation 
[RMSEA] = 0.005, standardized root mean square residual [SRMR] = 0.081). In this 
model, it is additionally assumed that the cross-lagged covariances of the traits are the 
same. The results show that the autocorrelations of the latent trait variables between the 
2 days are very high (r = .880 and .933). In addition, the cross-lagged correlations of the 
latent trait variables are very high (r = .851), showing high stability of the day-specific 
mood level. The within-day correlations of the two trait variables are r = .912, indicat- 
ing high homogeneity of the two test halves. The estimated autoregression parameter is 
Pas, = 0.343 (standardized: 0.343-0.378) for a mean time lag of about 2 hours. If one 
assumes an individually varying autoregression parameter, it is estimated as B = 0.839 
and indicates the autoregression for 1 minute. 

The models can be easily extended to situations with more than 2 days. In a model 
for multiple days, one of the models we have presented can be specified within each day. 
Then the latent day-specific trait variables and growth factors can be correlated over the 
different days. One can then test whether the same measurement structure, the same 
change process, and the same variability patterns hold for all days. 
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Assessment of Intraindividual Variability 


The models presented are measurement models that enable researchers to measure dif- 
ferent stable and variable components of momentary states. Sometimes researchers are 
interested not only in these components but also in the degree of intraindividual variabil- 
ity over time and in interindividual differences in intraindividual variability. To assess 
intraindividual variability, different coefficients, such as the mean square successive dif- 
ference (MSSD) or the individual standard deviation (ISD), have been defined (for an 
overview, see Ebner-Priemer, Eid, Stabenow, Kleindienst, & Trull, 2009; Ebner-Priemer 
& Trull, Chapter 23, this volume). These scores are reasonable measures of variability. 
However, they do not distinguish between types of change that are typical for variability 
(fluctuations around a habitual average level) and that are typical for change as a specified 
function of time (e.g., linear change in a linear growth model). If there is linear change 
but no variability, the MSSD and ISD could be very large. The advantage of the models 
presented in this chapter is that they allow testing of hypotheses about the type of change. 
Hence, it is possible to distinguish between different types of change and to consider the 
different types of change simultaneously, as the autoregressive latent state-trait growth 
curve model has shown. Based on the structural equation model, one can define differ- 
ent measures of intraindividual variability that measure exactly the type of variability 
and change in which one is interested, and that is not distorted by other types of change. 
Consider, for example, the latent autoregressive state-trait growth curve model. In this 
model, three types of intraindividual change and variability can be estimated: 


1. The first type of change is individual linear change measured by the values of the 
latent growth factors. If one is interested in the linear change value of a single individual, 
one can use the estimated growth factor score as an individual measure of change. 


2. The second type of change is the fluctuation of the occasion-specific scores of an 
individual around his or her growth line (or, if there is no linear growth, around his or 
her habitual level measured by the trait). If one is interested in this type of variability, one 
can estimate the individual factor scores on the latent occasion-specific variables OCC, 
and calculate for each individual the ISD of the occ,; values over time. This score repre- 
sents the amount of intraindividual variability that is not due to linear change. 


3. The third type of change is the variability of the scores on the variables UP, that 
represent that part of a true score that is not predictable. If one estimates the factor scores 
on these variables and calculates for each individual the ISD of the up,; values over time, 
one obtains individual variability scores that indicate pure variability, that is, unpredict- 
ability. To estimate factor scores on the latent residual variables UP,, these variables have 
to be specified as factors in the model. This may be done by including a factor for each 
variable UP,, and fixing its loading to 1 and the residual variance of the variables OCC, 
to 0. 

From a theoretical point of view, these scores are more elaborate than coefficients 
calculated on the basis of observed scores because they are based on a psychometric 
model in which the latent variables have a clear meaning with respect to the type of 
change considered. Another advantage of calculating coefficients of intraindividual vari- 
ability on the basis of the models presented is that they can also be easily calculated in 
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the case of missing data. Modern statistical procedures for handling missing data, such 
as full information maximum likelihood and multiple imputation (see Black, Harel, & 
Matthews, Chapter 19, this volume), allow us to estimate factor scores for all individuals, 
even if they have missing values on some variables. 


Practical Issues 


Handling of Missing Data 


Given that ambulatory assessment studies usually cover many occasions of measurement, 
and that individuals are analyzed in their natural environments, it is extremely hard, if 
not impossible, to avoid missing data. Many computer programs for structural equa- 
tion models have several options for handling missing data. Statistical methods such as 
full information maximum likelihood estimation and multiple imputation allow us to 
estimate the parameters of the model if the missing data are at least missing at random 
(Enders, 2010). Missing at random means that the missing data are not completely ran- 
dom, but that the variables explaining the missingness are known and included in the 
analysis (see Black et al., Chapter 19, this volume, for a more detailed explanation of the 
concepts). This has an important consequence for the planning of an ambulatory assess- 
ment study. Researchers have to think about the reasons for the missingness they expect 
and include variables in the study that are able to predict the missingness. For example, 
personality variables such as conscientiousness, as well as variables that assess mobile 
phone use (“I never forget my cell phone,” “I always respond to cell phone calls,” etc.), 
can be included. These variables predicting missingness then have to be included as so- 
called “auxiliary variables” in the structural equation model. The relationships between 
these variables (that are not interesting from a substantive point of view) and the other 
variables (that are of substantive interest) have to be specified in such a way that they do 
not influence the estimation of the model parameters (Enders, 2010). This can readily be 
done by the computer program Mplus using the AUXILTARY option under the VARIABLE 
command. 


Sample Size 


For an ambulatory assessment study that is meant to be analyzed with structural equation 
modeling, one important question is how large the sample size has to be. There is no clear 
rule, and much more research is necessary to answer this question. There are two ways to 
solve this problem: (1) an a priori power analysis and (2) Monte Carlo simulation studies. 
In ana priori power analysis, one wants to find out how large the sample size has to be to 
detect a proposed effect with a specific power on a specific level of significance (Bolger, 
Stadler, & Laurenceau, Chapter 16, this volume). In this case, the model has to be specified 
beforehand (see Hancock, 2006, for details). This requires that the number of occasions of 
measurement is given. However, increasing and decreasing the size of the model by add- 
ing and dropping occasions of measurement allow analysis of the effect of the number of 
occasions of measurement on the power of the test and the sample size needed. 

If one wants to find out how large the sample size has to be to make sure that the 
estimation methods and goodness-of-fit coefficients work fine, Monte Carlo simulation 
studies can be conducted. By systematically varying the sample size and the number 
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of occasions of measurement required, one can determine the minimal conditions with 
respect to the sample size and the occasions of measurement needed. Bandalos (2006), as 
well as Muthen and Muthen (2002), describe in detail how this can be done. 


Conclusion 


Structural equation modeling is a versatile methodological tool for ambulatory assess- 
ment studies for different reasons. First, it allows separation of unsystematic measure- 
ment error from systematic variability and measurement of latent variables representing 
different sources of variability and change. Second, it makes it possible to test specific 
hypotheses about the type of change. Third, it allows assessment of different scores of 
intraindividual variability even in the case of missing values. Fourth, it appropriately 
considers individually varying times of observations. 


Note 


1. We thank Claudia Crayen for explaining to us how this constraint can be specified in Mplus. 
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CHAPTER 22 


Analyzing Diary 
and Intensive Longitudinal Data 
trom Dyads 


JEAN-PHILIPPE LAURENCEAU 
NIALL BOLGER 


hus far, the chapters in this handbook have focused almost exclusively on methods 

for studying the daily life experiences of individuals. For most individuals, however, 
daily life means being embedded in a network of interpersonal relationships. In fact, it 
has been argued that the most important contexts for human beings are those involv- 
ing close relationships (Reis, Collins, & Berscheid, 2000). Among close relationships, 
many of the important ones are dyadic, with examples including marital partners, the 
mother-child and the father-child relationship, dating relationships, and student-teacher 
and supervisor-supervisee relationships. Any investigation of everyday human experi- 
ence would do well to examine these powerful influences on daily behavior, thoughts, 
and feelings. 

As we hope will become clear by the end this chapter, intensive longitudinal data 
from dyads allows investigators to ask important (and possibly) new questions that can- 
not be addressed in corresponding studies of independent individuals. The challenges of 
analyzing intensive longitudinal data from dyads come from two sources: (1) adequately 
conceptualizing the dyadic causal influence process under investigation and (2) specifying 
any residual statistical dependencies that occur in data obtained from two individuals 
who share daily life together. 

Our purpose in this chapter is to provide an introduction to analyzing intensive lon- 
gitudinal data that come from both members of a dyad. We illustrate our approach using 
an example daily diary dataset on social support transactions and relationship intimacy 
collected from both members of married couples in which the wife had breast cancer. 
Although we intend this chapter to be a self-contained primer, interested readers can 
find more detail on our approach to analyzing intensive longitudinal data from dyads in 
Bolger and Laurenceau (in press) and Laurenceau and Bolger (2005). Coverage on dyadic 
data analysis more generally can be found in Kenny, Kashy, and Cook (2006). 
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Methodological and Analytic Issues with Dyadic Data 


From our perspective, what make intensive longitudinal data from dyads such as spouses, 
friends, roommates, siblings, and parents interesting are the various dyadic processes that 
create statistical dependencies in the data. By dyadic processes we mean phenomena such 
as selection into the relationship as a function of similarity (e.g., in political beliefs) or 
complementarity (e.g., in dominance vs. submissiveness); direct causal influence by part- 
ners on one another’s thoughts, feelings, and behaviors (a process originally described 
by Kelley et al., 1983) and usually known as interdependence (e.g., marital partners 
argue with one another); and shared experiences that create similarity in dyad partners’ 
thoughts, feelings, and behavior (e.g., friends are part of the same cultural milieu and 
develop similar tastes in music). For the most part, these processes lead dyad members to 
be more similar to one another than two randomly chosen people from the population. In 
the case of the dyadic process of complementarity, however, it is possible for dyad mem- 
bers to be more different than two randomly selected people (e.g., dyad partners may be 
more different on dominance vs. submissiveness than are randomly chosen people who 
are not in the same dyad). 

Another way of thinking about this is to say that once we know the score of one dyad 
member on some variable (e.g., one marital partner’s level of satisfaction), it is possible to 
predict with above chance success the other dyad member’s score. In statistical terms the 
scores from dyad members are said to be nonindependent. If there are N dyads, doing 
an analysis that treats the data as coming from 2*N individuals typically violates the 
assumption of independent observations that is made in traditional analysis of variance 
(ANOVA)/regression models. Although nonindependence is an inherent phenomenon in 
most types of dyadic data, and explaining it is of interest, it is important to take noninde- 
pendence into account even when it is not the focus of interest because it biases inferential 
tests by typically increasing Type I errors (Kenny et al., 2006). 

Nonindependence can be quantified using the Pearson product-moment correlation 
coefficient for distinguishable dyads (e.g., husbands and wives, brother and sister, parent 
and child) or the intraclass correlation for nondistinguishable dyads (e.g., homosexual 
partners, same-sex siblings, same-sex roommates). Distinguishability is often determined 
conceptually by identifying some important characteristic (e.g., gender, age) that may 
differentiate partners. Moreover, there are also empirical tests of distinguishability that 
involve comparison of the fit of a model in which means, variances, and/or covariances 
for a set of variables are constrained to be equal across partners within a dyad to a model 
in which they are free to vary (Ackerman, Donnellan, & Kashy, 2011). Because of space 
limitations, we focus on distinguishable dyadic data in this chapter, but we refer inter- 
ested readers to other sources (Kashy, Donnellan, Burt, & McGue, 2008; Kenny et al., 
2006) for corresponding approaches to analyzing nondistinguishable dyadic data. 

We note that one way of dealing with the dependencies that typically come with 
dyadic data is to analyze data from each partner separately in two models. Analyzing 
data from each partner in separate models results in a loss of information about the very 
interdependencies that make dyadic diary data unique. Another approach that should 
largely be avoided as a way of getting around interdependence is creating a composite 
“dyad” score by taking the average of the two partner scores. An average dyad score can 
cover up effects that may be very different for each partner and should be used only after 
careful consideration (Kenny et al., 2006). 
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A particular framework for explaining causal interdependence in dyadic data that 
has received considerable attention is the actor—partner interdependence model (APIM; 
Kenny, 1996; Kenny et al., 2006). At the heart of this model is the specification of actor 
and partner effects. An actor effect is one in which an individual’s predictor has an 
influence on the individual’s outcome (i.e., an intrapersonal effect). A partner effect is 
one in which an individual’s predictor has an influence on the partner’s outcome (i.e., 
an interpersonal effect). Moreover, actor and partner effects are typically interpreted 
with both effects in the model simultaneously. Figure 22.1 illustrates an APIM based on 
a cross-sectional dyadic dataset involving intimacy scores as outcomes and measures of 
providing emotional support to one’s partner (support provision) as predictors for a set 
of female breast cancer patients and their male spouses. While it is hypothesized that 
a spouse’s report of providing emotional support to his partner (i.e., the breast cancer 
patient) will be associated with a shift in his own rating of intimacy (actor effect a,), it is 
also expected that spousal provision of emotional support will be associated with a shift 
in his partner’s rating of intimacy (partner effect p,). The corresponding paths for patient 
actor (a,) and partner (p) effects are also depicted. The illustrative example detailed in 
this chapter extends this model to the intensive longitudinal context. 

Finally, several types of designs can be used to collect dyadic data. For example, each 
individual in a group of sorority sisters can be asked to rate every other individual on a set 
of characteristics using a round-robin design (Kenny, 1994), producing a sort of dyadic 
data. However, the focus of this chapter is on the standard dyadic design (Kenny et al., 
2006), in which dyads are defined as one partner being linked to one and only one other 
partner in a sample, and both partners are assessed on the same variables. Specifically, 
we focus on the standard dyadic design integrated with a daily diary study of distinguish- 
able couples. 


Couples Coping with Breast Cancer: An Illustrative Example 


We now describe an illustrative example dataset that is the basis for the analytical exam- 
ples in this chapter. We draw motivation for this example from the literature on emo- 
tional support and coping in couples facing adversity (for a review, see Revenson, Kayser, 


Patient 
Support 
Provision 


Patient 


a 
a Intimacy 


Po 


Py 


Spouse 
Support 
Provision 


Spouse 


a Intimacy 


a = Actor effect p = Partner effect 
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FIGURE 22.1. Depiction of cross-sectional actor-partner interdependence model (patient and 
spouse intimacy and support provision). 
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& Bodenmann, 2005). In the context of individuals confronting a major stressor, receiv- 
ing support from one’s spouse (support receipt) has, somewhat counterintuitively, been 
associated with poorer individual well-being outcomes, while a spouse’s report of provid- 
ing support (support provision) has been associated with better individual well-being out- 
comes (Bolger, Zuckerman, & Kessler, 2000). However, support receipt has been found 
simultaneously to enhance feelings of relationship closeness (Gleason, Iida, Shrout, & 
Bolger, 2008). It is less clear whether the effects of support provision boost feelings of 
relationship intimacy, and whether spouses of individuals confronting stress also benefit 
from support provision. Moreover, more direct sampling of the daily experiences of both 
the individual facing the stressor and his or her partner could clarify beneficial supportive 
processes in the everyday lives of couples confronting a health-related stressor (Hage- 
doorn, Sanderman, Bolks, Tuinstra, & Coyne, 2008). 

This example focuses on examining the influence of daily support provision on 
daily relationship intimacy in breast cancer patients and their nonpatient spouses. The 
data, obtained shortly after surgery, comprise 75 women with early stage breast cancer 
and their spouses, who independently completed an electronic diary assessing support 
provision and relationship intimacy for 10 consecutive evenings. These data were based 
on parameter estimates from modeling variables in a published dataset of breast cancer 
patients and their spouses who consented to participation as approved by the Institutional 
Review Board of a local cancer center (Belcher et al., in press). Prospective participants 
were identified from their medical charts and the cancer center’s tumor registry, focusing 
on breast cancer as the primary diagnosis. Women were eligible if they had been diag- 
nosed with early stage breast cancer (i.e., Stage 1, 2, 3a, or ductal carcinoma in situ), had 
undergone breast cancer surgery within the prior 2 months, were married or cohabitating 
with a significant other, spoke and understood English, and had access to the Internet at 
night. Eligible participants were contacted by phone or during a regular outpatient clinic 
visit, during which a research assistant described the study. Participants completed daily 
diary measures for 10 consecutive nights via the Internet. The diary took approximately 
6-7 minutes to complete each night. Participant compliance with the diary procedure was 
monitored using a time-date stamp; thus, it could be determined whether participants 
completed their diaries as instructed (i.e., during the evenings). 

Participants completed items concerning daily relationship intimacy (Laurenceau, 
Feldman Barrett, & Pietromonaco, 1998; Laurenceau, Feldman Barrett, & Rovine, 
2005) and the composite that was formed ranged from 0 to 8. Participants also reported 
on whether they provided emotional support to their partners. Derived from Bolger and 
colleagues (2000), this measure posed the following to participants: “Did you provide 
any help to your partner (e.g., listening, comforting, and reassurance) for a worry, prob- 
lem, or difficulty your partner had in the past 24 hours?” Responses were coded such that 
“1” represented days on which emotional support was provided and “0” represented days 
on which support was not provided. Figure 22.2 depicts example intimacy and emotional 
support provision scores on particular days from both partners within couples. 

We hypothesized that on days when patients reported providing emotional support 
to their spouses, patients would feel more relationship intimacy than on days when they 
did not provide support (Figure 22.3; Path 1), reflecting an actor effect. A correspond- 
ing actor effect was also predicted for spouses (Figure 22.3; Path 2). We expected that 
on days when support provision was reported by spouses, patients would feel greater 
relationship intimacy (a partner effect), and that this effect would be independent of the 
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couple dayO pintim sintim pemosupp | semosupp 
1 3.63 4.69 
1 4.86 6.42 
1 5.87 3:55 
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3 3.17 4.04 


FIGURE 22.2. Data structure for patient (p) and spouse (s) daily intimacy and emotional support 
example. 


patients’ provision of support (Figure 22.3; Path 3). These same unique benefits of sup- 
port provision were also predicted to be significant for spouses (Figure 22.3; Path 4). 
Other details of Figure 22.3 are explained later in the chapter. 


A Multilevel Model for Dyadic Diary Data 


Dyadic data acquired via a method for studying daily life present several data-analytic 
challenges stemming from various sources of nonindependence in these data. Traditional 
statistical methods (e.g., ANOVA, regression) typically focus on observations that are 
independently sampled, whereas dyadic diary data are inherently nonindependent. Not 
only is there nonindependence between members of the dyad, but there is also noninde- 
pendence of observations within each dyad member. Thus, the types of dyadic data in 
which we are interested have three levels of analysis—the dyad, the persons within the 
dyad, and the observations within the persons. Nevertheless, for the types of intensive 
longitudinal data with which we typically work, as exemplified by the example dataset, 


412 DATA-ANALYTIC METHODS 


Patient Daily Spouse Daily 
Support Provision Support Provision 
(between) (between) 


Level-2 (between-couple) 


Level-1 (within-couple) 


Patient Daily Patient 
Support Provision Daily 


(within) Intimacy 


Spouse Daily Spouse 
Support Provision Daily 


(within) Intimacy 


FIGURE 22.3. Depiction of multilevel model for dyadic diary data, with actor effects, partner 
effects, and filled circles representing random effects. 


the within-person observations are crossed rather than nested. That is to say, the first day 
of the diary study for the patient is the same day as the first day for the spouse. This is an 
important distinction because it allows us to examine day-specific sources on dependency 
by allowing the patient’s residual on a given day to be correlated with the spouse’s resid- 
ual on that day. This would not be possible if the daily observations were merely nested 
and bore no relationship to one another across the dyad members, as would be the case 
in a multilevel model of repeated observations nested within partners nested with dyads. 

Multilevel modeling provides a conceptual framework and a flexible set of ana- 
lytic tools to handle this type of data structure. Typical applications of multilevel mod- 
eling focus on modeling longitudinal data in which intensively measured (e.g., daily) 
time series data are clustered within persons, and this was covered in the discussion of 
multilevel modeling by Nezlek (Chapter 20, this volume). To accommodate the dyadic 
nature of these repeated measures data, the multilevel model is expanded to accommo- 
date correlated outcomes. We consider the formulation of a multilevel model that lies at 
the intersection of diary (intensive longitudinal) data analysis and dyadic data analysis: 
the multilevel model for dyadic diary data. 

There are two somewhat different data setups and corresponding analytic approaches 
for implementing the multilevel model for dyadic diary data. The first is based on an 
approach that analyzes the three levels of distinguishable dyadic diary data described 
earlier as two levels in which the lowest level represents multivariate repeated measures 
(i.e., a set of diary days for patient and a corresponding set for spouse). This kind of 
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multilevel model was first described by Raudenbush, Brennan, and Barnett (1995) as a 
multivariate hierarchical linear model, and later by MacCallum, Kim, Malarkey, and 
Kiecolt-Glaser (1997) as a multivariate change model. It also draws on work by Kenny 
and Zautra (1995) and Gonzalez and Griffin (2000). Key aspects ofthe data structure for 
this first approach to multilevel modeling of dyadic diary data are the stacking of patient 
and spousal outcomes and predictors one on top of the other, and a nonconventional use 
of dummy-codes to select the male predictors of the male part of the stacked outcome 
and the female predictors of the female part of the outcome. As originally described by 
Raudenbush et al. (1995), the data restructuring and dummy-codes are needed to “trick” 
multilevel programs such as HLM and SAS PROC MIXED to model two outcomes simul- 
taneously because these programs are structured for single repeated measures outcomes. 
A more detailed description of the necessary stacked data structure, corresponding mul- 
tilevel model, and interpretation of illustrative results for this approach can be found in 
Laurenceau and Bolger (2005). 

For the remainder of this chapter, we focus on a second approach to implementing 
the multilevel model for dyadic diary data. This approach in some ways is more flexible 
than the first, in that it makes use of a general latent variable modeling framework in 
which multilevel random effects are conceptualized as latent variables (Muthén & Asp- 
arouhov, 2010). Moreover, the restructuring (i.e., stacking) of the data described earlier 
to “trick” multilevel programs to model two repeated measures outcomes simultane- 
ously is not necessary so long as the predictors and outcomes of interest are represented 
as patient variables and partner variables in a person-period (a.k.a., univariate or long 
format) dataset. We demonstrate this second approach using Mplus software (Muthén & 
Muthén, 1998-2010). We realize that many readers may not be familiar with Mplus for 
multilevel modeling, and that the Mplus syntax we provide later in the chapter may be 
intimidating and require some getting used to. Nevertheless, we believe the gain in flex- 
ibility can be worth it. We also think it worth noting that both approaches to analyzing 
dyadic repeated measures data produce results with the same parameter estimates and 
standard errors (within rounding error) when running the same model. 

Figure 22.2 shows the necessary data structure for the first 22 rows of data. Couple 
is an ID variable unique to each couple in the dataset; day0 is an index of each of the 10 
days of assessment (0-9); pintim is the daily intimacy score for participants; sintim is the 
daily intimacy score for spouses; pemosup is the binary daily report of emotional support 
provision from patients; and semosup is the binary daily report of emotional support pro- 
vision from spouses. Because of this data structure, the terms within-person and within- 
couple are used interchangeably, as are the terms between-couple and between-person. 

The simultaneously estimated pair of level-1 equations representing the within- 
couple multivariate multilevel model for the regression of intimacy on support provision 
is as follows: 


PINTIM,; = Boy + Bi, (PEMOSUP,,) + Br, (SEMOSUP,,) + erin 
SINTIM,, = Bois + Bris (SEMOSUP,) + Bai, (PEMOSUP,) + e,;, 


PINTIM,, and SINTIM,, are the daily patient and spouse intimacy scores, respectively, 
where t indexes diary day and i indexes couple; Bo; and Bo;, are separate individual patient 
and spouse model intercepts, respectively; PEMOSUP,, is an individual patient’s report 
of providing emotional support to the spouse on a particular day (coded 1 if the patient 
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reported providing support to the spouse and 0 if she did not); SEMOSUP,, is an indi- 
vidual spouse’s report of providing emotional support to the patient on a particular day 
(coded 1 if the spouse reported providing support to the patient and 0 if he did not); B,;, 
and ß,,. are individual patient and spouse coefficients representing the patient and spouse 
actor effects for emotional support provision, respectively, on relationship intimacy; B,;, 
and B; are individual patient and spouse coefficients representing the patient and spouse 
partner effects for support provision, respectively, on relationship intimacy; and e,,, and 
e; are individual patient and spouse error components that are allowed to correlate to 
capture daily residual covariation in patient and spouse intimacy. Although not depicted 
in this model, one can also account for the presence of autocorrelated errors across daily 
assessments (Bolger & Laurenceau, in press; Bolger & Shrout, 2007). It should be noted, 
however, that modeling autocorrelated errors or similar error structures (e.g., Toeplitz) 
for this multilevel model in Mplus software is impractical at this point in time. However, 
to the extent that any autocorrelation in residuals is due to omitted time trends, one can 
account for this by adding elapsed time as an additional within-person predictor of daily 
intimacy (for more discussion of this issue, see Bolger & Laurenceau, in press). 

Finally, random effects were estimated for the pairs of intercepts, actor effects, and 
partner effects for both patients and spouses: 


Boip = Yoop + “oip 
Biip = Viop + “rip 


Bois = Yoos + Uzis 


Bois = Yoos + “ois 
Biis = Yıos + “ris 


By in = Yaop + “rip 


The corresponding y parameters represent the fixed effects. All equations above make 
the following assumptions about distribution of error terms: The random effects (19;,, 
Uj ips Uris) Uoiss Wis and u,,,) are assumed to have a normal distribution with a mean of 
0, effect-specific variances, and covariances; the residual variances e,,, and e,, have a 
distribution with mean 0 and variances o°, and o*,, respectively. We performed these 
multilevel analyses using Mplus 6 (Muthén & Muthén, 1998-2010), and the syntax cor- 
responding to the final model is provided in Appendix 22.1. 

Before we review the Mplus syntax for specifying this model, we take an important 
departure involving the separation of within-person and between-person effects in inten- 
sive longitudinal predictors. As is the case for the outcome daily intimacy, note that the 
predictor emotional support provision varies both within-persons and between-persons 
(i.e., each patient’s support provision fluctuates from day to day around an average typical 
level of support provision that itself varies from patient to patient). We wish to decompose 
the influence of emotional support on intimacy into two effects: one reflecting the within- 
person association between intimacy and emotional support, and the other reflecting the 
between-persons association between intimacy and emotional support. Therefore, we split 
the original predictor variable (pemosup) into two predictor variables: a stable between- 
persons mean for each person across their diary days (pemosupb) and the deviations from 
this mean representing within-person change (pemosupw). Of particular interest is the 
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within-person slope of daily intimacy regressed on pemosupw. To test the between-persons 
effect of support provision as well, we add each patient’s mean support provision (pemo- 
supb) to the between-person part of the model. Therefore, between-person variability in 
patient and spouse support provision is accounted for by including in the model the person 
means of these within-person predictors from both partners (Raudenbush & Bryk, 2002). 
This allows the effects of patient and spouse daily support provision (i.e., pemosupw and 
semosupw) decomposition of within-person change in support from between-person dif- 
ferences in support to be interpreted as “pure” within-person effects. 

Below is the Mplus syntax we use to estimate this multilevel model for dyadic diary 
data, where capitalized terms refer to Mplus-specific commands and options, and lower- 
case terms are user-defined variables: 


= 


TITLE: Breast cancer couples and daily support example; 
DATA: FILE IS chapter22example.dat; 
DEFINE: pemosupb = CLUSTER_MEAN (pemosupw); 
semosupb = CLUSTER_MEAN (semosupw); 
: NAMES ARE pintim sintim day0 pemosupw semosupw couple; 
couple dayO pintim sintim pemosupw semosupw 
pemosupb semosupb; 
BETWEEN = pemosupb semosupb; 
WITHIN = day0 pemosupw semosupw 
CLUSTER = couple 
iC 
iC 


VARIABLI 
USEVAR 


DJ 


ENTERING is GROUPMEAN (pemosupw semosupw day0) 
ENTERING is GRANDMEAN (pemosupb semosupb) 
ANALYSIS: TYPE = twolevel random 
MODEL: SWITHINS 

pactorw | pintim ON pemosupw 


ppartw | sintim ON pemosupw 
sactorw | sintim ON semosupw 


spartw | pintim ON semosupw 
pintim sintim ON dayO 
pintim WITH sintim 
SBETWEEN% 
pintim ON pemosupb semosupb 
sintim ON pemosupb semosupb 
pintim WITH sintim pactorw sactorw ppartw spartw 
sintim WITH pactorw sactorw ppartw spartw 
pactorw WITH sactorw ppartw spartw 
sactorw WITH ppartw spartw 
ppartw WITH spartw 
![ppartw] (1); [spartw] (1) 
OUTPUT: cinterval; 


After the TITLE and DATA specifications, a set of DEFINE statements is used to cre- 
ate new variables that are the cluster (couple/person) means for patient and spouse daily 
support provision, pemosupb and semosupb. These new variables are used to account 
for between-couple differences in patient and spouse support provision. The VARIABLE 
command lists off the names of the variables in the dataset that is read in by Mplus. The 
USEVAR statement lists off the variables that are actually used in the analysis (note that 
the two new person-mean variables we created are included at the end of this list). The 
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BETWEEN statement contains the predictor variables in the analysis that vary only between 
persons. The WITHIN statement contains the predictor variables in the analysis that vary 
only within persons. The grouping or clustering variable is specified in the CLUSTER state- 
ment. The CENTERING statements are used to GRANDMEAN center the between-person vari- 
ables and GROUPMEAN center the two patient and spouse daily support provision predic- 
tors to create person mean-deviated versions of these predictors. 

The TYPE = TWOLEVEL option on the ANALYSIS command tells Mplus that this is a 
two-level multilevel model with random slopes. The first four lines of syntax in the MODEL 
command under the swITHINS specification make use of the “|” symbol, which tells Mplus 
to create random intercepts and slopes for the within-person regression relationship that 
follows it. In the first of these lines, pactorw is a label that is given to the random slope 
of the regression of pintim ON pemosupw, which is the patient’s within-persons actor 
effect of daily intimacy on emotional support provision. Using the Mplus syntax logic, the 
random slope effect is called pactorw and the random intercept is called pintim. The line 
pintim sintim ON day0 is used to control for any possible linear time trends in the inti- 
macy outcomes over days. The final line under $wITHIN®, pintim WITH sintim, is what 
allows for a correlation between patient and spouse within-person error components, 
which captures daily residual covariation in patient and spouse intimacy outcomes. 

The lines under the $BETWEEN$ statement specify the between-persons part of the 
multilevel model. The first two of these lines use the person means of patient and spouse 
support provision as predictors to control for between-person differences in patient and 
spouse support. The following set of statements containing WITH specify the various 
covariances among the six between-person random effects: the two random intercepts 
for patient and spouse, the random actor effects of support provision for patients and 
spouses, and the random partner effects of support provision for patients and spouses. 
The penultimate line of syntax begins with a “!,” which is the symbol Mplus uses to 
comment out code that will not be processed by the program. This line of code will be 
explained a little bit later. Finally, the OUTPUT statement requests confidence intervals 
for all effects in the output. 

An excerpted output for this run is provided here and corresponds to the effects 
reported in Table 22.1. For the sake of conciseness in Table 22.1, we did not report the 
nonsignificant effects of day0 nor all of the covariances among the between-person ran- 
dom effects. However, all effects in Table 22.1 can be found in the excerpted sections of 
output that follow: 


pa 


ph 


MODEL RESULTS 
Two-tailed 


Estimate S.E. Est./S.E. p-value 
Within Level 

PINTIM ON 

DAYO -0.005 0.013 -0.391 0.696 
SINTIM ON 

DAYO -0.020 0.013 -1.517 0.129 
PINTIM WITH 

SINTIM 0.201 0.050 4.003 0.000 
Residual Variances 

PINTIM 0.933 0.052 17.895 0.000 

SINTIM 0.944 0.067 14.093 0.000 
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TABLE 22.1. Estimates from a Dyadic Multilevel Model of Daily Intimacy 
as a Function of Support Provision 


Outcome: Daily relationship intimacy 
Model parameters 
95% CI 
Estimate (SE) Z p-value? Lower Upper 


Fixed effects (intercept, slopes) 


Patient Intercept, Boo, 4.11 (0.09) 43.67 <.001 3.93 4.29 
Spouse Intercept, Boos 3.82 (0.09) 40.39 <.001 3.63 4.00 
P Support Provision (Actor), Bop 0.45 (0.10) 4.45 <.001 0.25 0.65 
S Support Provision (Actor), Bois 0.45 (0.10) 4.47 <.001 0.26 0.65 
P Support Provision (Partner), Boo, 0.15 (0.09) 1.61 110 -0.03 0.32 
S Support Provision (Partner), Boo, 0.35 (0.10) 3.62 <.001 0.16 0.53 
P Mean Provision > P Intimacy 1.43 (0.65) 2.19 .028 0.15 2.71 
S Mean Provision > P Intimacy 0.12 (0.64) 0.18 .857 1.14 1.37 
P Mean Provision > S Intimacy 0.90 (0.56) -1.62 110 -1.99 0.19 
S Mean Provision > S Intimacy 0.12 (0.62) 0.19 .847 -1.09 1.34 
Random effects ((co)variances) 

Level-2 (between-couple) 

Patient Intercept variance 0.57 (0.11) 5.25 <.001 0.36 0.78 
Spouse Intercept variance 0.58 (0.09) 6.33 <.001 0.40 0.75 
P-S Intercept covariance 0.31 (0.09) 3.53 .001 0.14 0.47 
P Actor Effect variance 0.32 (0.12) 2.63 .008 0.08 0.56 
S Actor Effect variance 0.25 (0.13) 2.02 .045 0.01 0.50 
P-S Actor Effect covariance 0.15 (0.11) 1.34 .180 0.07 0.38 
P Partner Effect variance 0.17 (0.11) 1.57 .117 0.04 0.38 
S Partner Effect variance 0.19 (0.10) 1.92 .055 0.00 0.39 
P-S Partner Effect covariance 0.13 (0.09) 1.45 .146 0.04 0.29 
Level-1 (within-couple) 

Patient Residual 0.93 (0.05) 17.90 <.001 0.83 1.04 
Spouse Residual 0.94 (0.07) 14.09 <.001 0.81 1.08 
P-S Residual covariance 0.20 (0.05) 4.00 <.001 0.10 0.30 


Note. N = 75 couples; 750 days. P, patient; S, spouse. 
“All p values are two-tailed except in the case of variances, where one-tailed p-values are used (because variances are 
constrained to be non-negative). 
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The within-person results indicate that there are no significant effects for day0, indi- 
cating no linear time trends for either patient or spouse daily intimacy. As might be 
expected, there is significant positive covariation between patient and spouse within- 
person residuals. This indicates that days on which a patient’s intimacy is higher than 
her average, her spouse’s intimacy tends to also be higher than his average. Finally, the 
estimates of patient and spouse within-person residuals are provided. 


Two-tailed 


Estimate S.E. Est./S.E. P-value 
Between Level 

PINTIM ON 

PEMOSUPB 1.430 0.652 2.193 .028 

SEMOSUPB 0.116 0.640 0.181 .857 
SINTIM ON 

PEMOSUPB -0.900 0.557 -1.616 .106 

SEMOSUPB 0.120 0.622 0.193 .847 
PINTIM wITH 

PACTORW -0.050 0.072 -0.696 .487 

SACTORW -0.074 0.082 -0.896 .370 

PPARTW 0.099 0.078 1,277 202 

SPARTW -0.039 0.103 -0.383 .702 
SINTIM WITH 

PACTORW -0.034 0.065 -0.514 .607 

SACTORW 0.036 0.077 0.464 .643 

PPARTW 0.158 0.079 1.999 .046 

SPARTW -0.028 0.083 -0.336 1347 
PACTORW WITH 

SACTORW 0.153 0.114 1.340 .180 

PPARTW 0.001 0.103 0.011 .992 

SPARTW 0.058 0.074 0.787 431 
SACTORW WITH 

PPARTW -0.113 0.090 -1.255 .210 

SPARTW -0.123 0.081 -1.515 .130 
PPARTW WITH 

SPARTW 0.125 0.086 1.453 .146 
PINTIM wITH 

SINTIM 0.305 0.086 3,532 .000 
Means 

PACTORW 0.454 0.102 4.446 .000 

PPARTW 0.146 0.091 1.610 .107 

SACTORW 0.454 0.101 4.472 .000 

SPARTW 0.346 0.096 3.615 .000 
Intercepts 

PINTIM 4.110 0.094 43.677 .000 

SINTIM 3.816 0.094 40.397 .000 
Variances 

PACTORW 0.319 0.121 2.633 .008 

PPARTW 0.170 0.108 1.570 .117 

SACTORW 0.253 0.126 2.002 .045 

SPARTW 0.191 0.100 1.919 .055 
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Residual Variances 
PINTIM 0.571 0.109 5.247 .000 
SINTIM 0.575 0.091 6.325 .000 


The first set of results for the between-person part of the model are the between- 
person effects of patient and partner support provision. These effects are based on 
between-person variance in support provision explaining between-person variance in 
daily intimacy. For example, patients with higher average levels of support provision 
also report higher average levels of daily intimacy, y = 1.430, p = .028. What follows 
are the specific covariances among the random intercept, actor, and partner effects. Of 
particular interest might be the covariances between patient and spouse intercepts, actor 
effects, or partner effects. For example, the covariance between patient and spouse ran- 
dom intercepts is significant (.305, p < .001) and indicates that patients who have high 
average levels of intimacy tend to be paired with spouses who have high average levels of 
intimacy. 

The fixed effects of the patient and spouse actor effects can be found under the Mplus 
output heading called “Means.” These within-person effects can be interpreted for the 
typical patient and spouse in the sample. Actor effects for both patient and spouse sup- 
port provision are significant, indicating that days on which a patient or spouse indicated 
that he or she provided support are associated with an increase in his or her own daily 
intimacy (Yıop = -46, p < .001 and Yio; = 0.46, p < .001). These effects were not constrained 
to be equal but by chance are almost identical in magnitude. There is also a positive and 
significant spouse-partner effect, indicating that days on which a spouse indicated that 
he provided support to his wife are associated with an increase in her daily intimacy (po: 
= 0.35, p < .001). The corresponding partner effect for patients was positive but techni- 
cally fell short of statistical significance (Yo, = 0.15, p = .11). Under “Variances” and 
“Residual Variances,” the random effects for both patient and spouse intercepts, actor 
effects, and partner effects appear to be somewhat substantial, as indicated by the mostly 
statistically significant Wald tests of these variances. 

Figure 22.3 on page 412 presents a visual depiction of the multilevel model that was 
estimated as a path model scheme. The dashed line separates variables that are at the 
within (lower) versus between (upper) levels of analysis. Paths 1 and 2 represent the fixed- 
effect actor effects and Paths 3 and 4 represent the fixed-effect partner effects. The filled 
black circles on the two outcomes represent estimated random effects for the intercepts 
of their respective equations. The filled black circles on the four paths represent estimated 
random effects for these effects. 

Based on these findings, one might be tempted to conclude that while there are signif- 
icant patient and spouse actor effects for support provision, there is a “gender” difference 
in the partner effects because the partner effect was significant for spouse but not statisti- 
cally significant for patient. Before such an interpretation can be valid, an inferential test 
must be performed to test directly the difference between the patient and spouse partner 
effects. To do this in Mplus, we added to the bottom of the syntax the following line of 
code: [ppartw] (1); [spartw] (1). This tells Mplus to constrain the fixed effect for the 
patient partner effect to be equal to the fixed effect for the spouse partner effect. Running 
this revised code allows us to produce a test of the difference in deviances between this 
constrained model (-2*LL = -2*-2256.592 = 4513.18) and the previous unconstrained 
model (-2*LL = -2*-2255.185 = 4510.37) as a Ay? test of 2.81, p = .09, suggesting that 
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the constraint does not significantly reduce model fit. This constraint not only tests the 
difference between patient and partner fixed effects but also provides a more powerful 
test of this single pooled effect (which is now Yo, = 0.24, p = .001) in comparison to the 
separate effects for patients and spouses. 


Concluding Comments 


Multilevel modeling of diary and intensive longitudinal dyadic data presents not only 
analytic challenges but also unique opportunities for uncovering potentially important 
interpersonal processes. The findings for the example dataset highlight the use of dyadic 
diary methods and corresponding modeling to uncover actor and partner effects for sup- 
port provision of daily intimacy. Specifically, we found that providing support was asso- 
ciated with a daily boost in intimacy for both breast cancer patients and their spouses. 
Moreover, spouses, as well as patients, benefited from their partner’s provision of sup- 
port. This set of effects could only have been revealed by using dyadic intensive longitu- 
dinal designs and a corresponding dyadic analytic framework. 

Nevertheless, there are future directions for further enhancing these types of dyadic 
longitudinal models. For example, the approach discussed earlier needs to be expanded 
to include an assessment of within-partner over-time autocorrelation in outcome scores. 
Moreover, the approach assumes measurement invariance of the outcomes and predic- 
tors both between partners and within partners. Lastly, the variables in the model are 
assumed to be assessed without measurement error. The more general latent variable 
multilevel modeling framework (Muthén & Asparouhov, 2010) described in this chapter 
may be extended further in future work to provide practical solutions to these desirable 
enhancements. 

Finally, we encourage daily diary researchers and those considering the use of daily 
diary methods also to contemplate collecting diary data from significant others in the 
lives of their participants. Humans are inherently social creatures, and dyadic data allow 
important aspects of the interdependent nature of everyday life to be revealed. 
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APPENDIX 22.1. Mplus Syntax for Final Model 


TITLE: 
DATA: 


DEFINE: 


je 


VARIABLE: 


USEVAR = 


ANALYSIS: 


MODEL: 


OUTPUT: 


Breast cancer couples and daily support example; 


FILE IS chapter22example.dat; 


pemosupb = CLUSTER_MEAN (pemosupw); 


semosupb = CLUSTER_MEAN (semosupw) ; 


NAMES ARE pintim sintim day0 pemosupw semosupw couple; 


couple dayO pintim sintim pemosupw semosupw 


pemosupb semosupb; 


BETWEEN = pemosupb semosupb; 


ITHIN = dayO pemosupw semosupw; 


‚USTER = couple; 


ENTERING is GROUPMEAN (pemosupw semosupw day0); 


B 
wW 
G 
G 
Cc 


ENTERING is GRANDMEAN (pemosupb semosupb) ; 


TYPE = twolevel random; 


SWITHINS 


pactorw | pintim ON pemosupw; 


ppartw | sintim ON pemosupw; 


sactorw | sintim ON semosupw; 


spartw | pintim ON semosupw; 


pintim sintim ON day0; 


pintim WITH sintim; 


ETWEEN 


pintim ON pemosupb semosupb; 


sintim ON pemosupb semosupb; 


pintim WITH sintim pactorw sactorw ppartw spartw; 


sintim WITH pactorw sactorw ppartw spartw; 


pactorw WITH sactorw ppartw spartw; 


sactorw WITH ppartw spartw; 


ppartw WITH spartw; 


[ppartw] (1); [spartw] (1); 


cinterval; 
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CHAPTER 23 


Investigating Temporal Instability 
in Psychological Variables 


Understanding the Real World 


as Time Dependent 


ULRICH W. EBNER-PRIEMER 
TIMOTHY J. TRULL 


P sychology has long neglected the temporal dynamics of psychological variables and 
symptoms. In particular, traditional trait research has tended to focus on behavioral 
aspects that highlight the stability of the constructs of interest. However, emotions and 
behavior do show an impressive variability over time within subjects. Some authors even 
suggest that affective experiences have meaning only because they change over time (Fri- 
dja, 2007; Kuppens, Oravecz, & Tuerlinckx, 2010; Scherer, 2009). Neglecting this insta- 
bility component may result in an inadequate characterization of emotions and behav- 
iors. Focusing on instability portrays emotions and behavior as dynamic variables that 
not only adapt continuously to situational effects but also similarly depend on individual 
personality characteristics. 

The historical neglect of the “dynamic” perspective is not surprising given the lim- 
itations of assessing and analyzing time-dependent variables and symptoms based on 
cross-sectional retrospective reports and traditional statistical approaches. Fortunately, 
ambulatory assessment, a research methodology for studying behavior and experience in 
daily life, has dramatically improved our ability to investigate dynamics of psychologi- 
cal processes. Although this methodology goes by different names, such as ambulatory 
assessment (Fahrenberg & Myrtek, 1996, 2001; Fahrenberg, Myrtek, Pawlik, & Perrez, 
2007), ecological momentary assessment (Stone, Shiffman, Schwartz, Broderick, & Huf- 
ford, 2002), experience sampling (Csikszentmihalyi & Larson, 1987), or real-time data 
capture (Stone, Shiffman, Atienza, & Nebeling, 2007), it is characterized by the use of a 
(computer-assisted) methodology to assess self-reported symptoms, behaviors, or physi- 
ological processes while the participant engages in normal everyday activities. Ambula- 
tory assessment offers several advantages over traditional forms of assessment: (1) real- 
time assessment circumvents biased recollection and cognitive reconstruction of the past; 
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(2) the assessment in real-life situations enhances generalizability; (3) repeated assessment 
of individuals delivers a series of data points that can be used to model instability of 
experience and within-person processes; (4) multimodal assessment includes psychologi- 
cal, physiological, and behavioral variables; (5) assessment of the context of the report 
allows the examination of setting- or context-specific relationships; and (6) the possibility 
to give feedback in real time. 

We propose that many psychological variables are time dependent, including affec- 
tive states. They are influenced by environmental characteristics that change over time, 
and they depend on past psychological states (one’s affective state 15 minutes ago) as 
well as influencing future psychological states (one’s next affective state in 15 minutes). 
Therefore, chronology is a fundamental part of these psychological variables. As men- 
tioned earlier, some authors have even suggested that psychological variables such as 
affective experiences have meaning because only they are time dependent (i.e., they 
change over time) (Fridja, 2007; Kuppens et al., 2010; Scherer, 2009). Serious confu- 
sion has often emerged from the interchangeable use of the terms instability and vari- 
ability. There is, though, a crucial difference. Variability is defined as the dispersion of 
scores from a central tendency; as such it does not consider the temporal order of scores 
(Ebner-Priemer, Kuo, et al., 2007; Jahng, Wood, & Trull, 2008; Larsen, 1987; Trull et 
al., 2008). In contrast, instability involves both variability and temporal dependency of 
a process. If temporally successive data points for an individual are highly correlated 
(i.e., high temporal dependency), then this reflects low temporal instability. Therefore, a 
high level of instability requires not only great variability but also a low level of tempo- 
ral dependency. An instructive example may help to clarify this important distinction. 
Imagine spending your weekend in the mountains. Unfortunately, the weather is very 
unstable (i.e., rain and sunshine alternated every hour). In contrast, your partner spent 
the weekend visiting relatives on the beach. The weather there was not perfect either. On 
Saturday it was sunny all day, but on Sunday it was raining continuously. Even though 
we would call the weather in the mountains unstable and the weather on the beach more 
stable, the variability of the weather in both locations would be identical. However, only 
considering the temporal order of weather change enables us to distinguish these loca- 
tions. 

We discuss this issue in greater detail below, but we want to highlight early on that 
our chapter focuses on time-dependent processes, not just on variability. 

Our aim in this chapter is to provide an overview of instability as an important con- 
cept in the study of psychological variables in daily life. To this end, we (1) briefly men- 
tion several reviews on the study of temporal instability in psychology; (2) report on some 
typical misconceptions in the investigation of instability, especially with regard to (a) the 
assessment instrument, (b) the sampling protocol, (c) the time-based design, (d) analytic 
indices, and (e) the graphical data description; and (3) conclude by presenting guidelines 
and recommendations for the investigation of instability. 


Temporal Instability in Psychology 


Because this chapter is concerned primarily with methodological issues, we do not 
present a comprehensive review of all papers investigating instability and variability in 
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psychology. However, two special issues of psychological journals have presented good 
overviews of the literature. 

In individual-difference research, a special issue, “Advances in Personality and Daily 
Experience,” in the Journal of Personality was edited by Tennen, Affleck, and Armeli 
(2005). The main topics were personality and reactions to everyday events, interpersonal 
manifestations of personality in daily experience, and applications of daily personality 
processes to health. Even though instability was not the main focus of the special issue, 
most papers dealt with variation in daily life and daily processes, as did the precursor 
special issue ”Personality and Daily Experience” published in 1991 in the same journal 
(Wheeler & Reis, 1991). Concerning research on emotions, a 2009 special issue of Cogni- 
tion and Emotion, edited by Kuppens, Stouten, and Mesquita, is of particular interest. In 
this issue, theoretical contributions describing emotions as flexible, multicomponential, 
and dynamical processes are complemented by empirical work on appraisal components 
and the experience of specific emotions, as well as by contributions that directly address 
individual differences in the temporal dynamics of emotions. 

Most relevant to the field of clinical psychology are two additional points. First, there 
are psychological disorders that are explicitly defined by instability over time (American 
Psychiatric Association, 2000). Bipolar I or bipolar II disorder and borderline personal- 
ity disorder (BPD) are the most obvious examples. Second, it is noteworthy that marked 
instability has been revealed in a range of psychiatric disorders for which “instability” is 
not included as a diagnostic criterion, such as anxiety disorders (Pfaltz, Michael, Gross- 
man, Margraf, & Wilhelm, 2010), bulimia (Anestis et al., 2010), and depression (Peeters, 
Berkhof, Delespaul, Rottenberg, & Nicolson, 2006). These findings highlight how meth- 
ods that can uncover instability may reveal unstable processes across a range of condi- 
tions, moods, and behaviors, sometimes where we do not expect it. 


Pitfalls in Assessing Instability 


Unstable and dynamic processes can be investigated only if time is both considered and 
integrated into the research question itself, the assessment or sampling method, and the 
data-analytic strategy. A failure to consider time as a factor in the research question, data 
collection, or data analysis will limit the validity of the study. For example, the assess- 
ment of psychological variables with ambulatory sampling techniques produces data 
along an individual time series, but in the absence of appropriate analytic techniques, the 
researcher’s ability to characterize the dynamic process of interest adequately is severely 
hampered. 


Choosing an Assessment Instrument 


The assessment of instability, by definition, requires that the researcher track the variables 
of interest over time. We have already argued that this can easily be done with ambulatory 
assessment technologies. Nevertheless, some researchers attempt to take a “shortcut” in 
assessing instability by administering retrospective questionnaires and interviews. This 
approach suffers from a major drawback, because retrospective measures of instability 
are even less reliable than retrospective estimates of average (mean) experiences. 
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For example, the International Personality Disorder Examination (IPDE; Loranger, 
1999) provides a retrospective assessment of instability by interview. Questions such as 
“Do your feelings often change very suddenly and unexpectedly, sometimes for no obvi- 
ous reason?” are used to estimate affective instability in BPD. The Affective Lability 
Scales (ALS; Harvey, Greenberg, & Serper, 1989) is an example of retrospective assess- 
ment of affective instability by questionnaire. Statements such as “One minute I can be 
feeling OK and then I feel tense, jittery, and nervous” are used to rate the degree of affec- 
tive instability experienced. 

It has been documented (see Schwarz, Chapter 2, this volume) that retrospection is 
subject to multiple systematic distortions (Fahrenberg et al., 2007; Stone et al., 2007). 
Recall is often based on biased storage and recollection of memories (Fredrickson, 
2000). Multiple memory heuristics have been identified, such as the duration neglect, the 
mood-congruent memory effect, and the affective valence effect, all of which not only 
increase inaccuracy but also introduce systematic errors (for a detailed discussion, see 
Ebner-Priemer & Trull, 2009). As a consequence, the U.S. Food and Drug Administra- 
tion (FDA; 2009) recently issued guidance for the pharmaceutical industry, noting that 
real-time data are desirable for evaluating patient-reported outcomes. Unfortunately, the 
recall of changes or fluctuations appears to be even more unreliable than the recall of 
more general, collectively “averaged” and recalled experience of past symptoms. 

For example, Solhan, Trull, Jahng, and Wood (2009) have examined the discrepancy 
between questionnaire, retrospective report, and e-diary measures of affective instabil- 
ity. Patients with BPD and major depressive disorder took part in an e-diary study over a 
period of 4 weeks and completed three trait measures of affective instability: the Affective 
Instability subscale of the Personality Assessment Inventory Borderline Features scale 
(Morey, 1996), the Affect Intensity Measure (Larsen, Diener, & Emmons, 1986), and 
the ALS (Harvey et al., 1989). Participants also provided a retrospective report of affec- 
tive experience, including large shifts in mood state, for the preceding 4-week period. 
Results indicated that questionnaire measures of affective instability showed inconsistent 
and relatively modest associations with participants’ experienced affective instability, as 
derived from the e-diary reports of mood states in patients’ natural environment (e.g., the 
range of correlations between the ALS and instability measures derived from the e-diary 
was .03 to .34). Retrospective reports of extreme mood changes, whether over the previ- 
ous month or even the preceding week, were largely unrelated to e-diary results. 

Stone, Broderick, Shiffman, and Schwartz (2004) investigated the association 
between recalled pain and pain assessed by multiple momentary reports using e-diaries. 
They assessed recalled pain levels for 2 weeks, recalled judged changes of pain from 1 
week to another, and multiple momentary reports. Multiple momentary reports were used 
to calculate both pain levels for each week and changes in pain from 1 week to another. 
As the associations between recalled and momentary pain was weak, they proposed that 
recall measures that assess change scores retrospectively are inherently unreliable, in con- 
trast to measures based on momentary assessment, which are less susceptible to bias. 

This is in line with the results of Ebner-Priemer, Bohus, and Kuo (2007), who showed 
that the coherence between expert ratings (of BPD criteria based on a diagnostic inter- 
view) and ambulatory assessment was lower for a criterion defined by instability (affec- 
tive instability) compared to a criterion for which instability is not part of the definition 
(inappropriate intense anger). For the assessment of instability, we therefore recommend 
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methods that make use of real-time assessment and multiple time points, such as those 
provided by ambulatory assessment. 


Choosing a Sampling Protocol 


If ambulatory assessment is used for data collection, a sampling protocol must be defined. 
There are five main sampling protocols: (1) time-contingent sampling; (2) event-contingent 
sampling; (3) continuous sampling; (4) interactive sampling; and (5) combined sampling. 
These protocols are explained in more detail by Conner and Lehman (Chapter 5, this 
volume), Fahrenberg and colleagues (2007), and Shiffman (2007). 

Event-contingent recordings are especially useful for investigating discrete events 
such as substance use or social interactions, because participants provide data only when 
an event actually occurs. Research examples of event-contingent recordings include the 
investigation of dynamic influences on smoking relapse process (Shiffman, 2005). But 
event-contingent recordings in assessing instability processes can suffer from the disad- 
vantage that data are gathered only when an event occurs. For example, analyzing the 
instability of mood on the basis of event-contingent recording of social interaction is 
limited to mood during social interactions and does not capture mood at other times, 
when the participant is not interacting. This is especially problematic because mood and 
mood instability may be associated with the social interaction itself (positive mood dur- 
ing social interaction, neutral mood during interaction-free periods). 

Time-contingent recordings (e.g., hourly assessments, random assessments within 
the day) are best for investigating the dynamics of continuous phenomena. Research 
examples of time-contingent recordings include the investigation of the ebb and flow of 
manic symptoms in bipolar disorders (Bauer et al., 2006), or the investigation of affec- 
tive instability in BPD (Ebner-Priemer, Kuo, et al., 2007; Trull et al., 2008) or anxiety 
disorders (Pfaltz et al., 2010). 

Combined sampling approaches integrate time-contingent with event-contingent 
recordings and are useful in the study of the interplay between events and continuous 
phenomena. The investigation of mood- and restraint-based antecedents to binge epi- 
sodes in bulimia nervosa (Steiger et al., 2005) is a good example of combined sampling 
plans. More recently, Preston and colleagues (2009) conducted an ambulatory assessment 
of 112 polydrug users (heroin and cocaine) and found that craving (assessed at random 
intervals) increased linearly in the 5 hours preceding cocaine use but not heroin use. 


Choosing the Time-Based Design 


The time-based design is a crucial part of the research design in ambulatory assessment. 
It is defined by the number of assessment points, the intervals between assessment points, 
and the total length of the assessment period. There is general agreement that the time- 
based design should fit the temporal dynamics of the processes of interest (Bolger, Davis, 
& Rafaeli, 2003; Collins, 2006; Collins & Graham, 2002; Ebner-Priemer, Welch, et al., 
2007; Fahrenberg, Leonhart, & Foerster, 2002; Pawlik & Buse, 1982; Stone & Shiffman, 
2002; Warner, 1998). 

A broad range of difficulties emerges when the time-based design does not fit the 
dynamics of the processes of interest. Intervals that are either too long or too short 


428 DATA-ANALYTIC METHODS 


can be misleading (Bolger et al., 2003; Collins & Graham, 2002; Scollon, Kim-Prieto, 
& Diener, 2003; Warner, 1998). Excessively long intervals may fail to detect natural, 
higher-frequency cycles and exclude important processes. For example, once-a-month 
assessments of affective states do not track a daily or weekly affective process but may 
be well suited for random assessments of affective states over the lifespan. On the other 
hand, too-short intervals may also fail to capture important systematic within-subject 
variability. For example, assessing affect every millisecond for a total of 1,000 assessment 
points might not capture within-subject variability well, because the affective change of 
interest might not occur within 1 second. Too-short intervals also increase the burden 
placed on participants. 

Unfortunately, no general conventions for time-based designs in e-diaries have been 
established. The lack of published recommendations is not surprising given that the tem- 
poral dynamics of emotional or cognitive processes are largely unknown (for a notable 
exception, see Hasler, Mehl, Bootzin, & Vazire [2008], who investigated 24 sinusoidal 
rhythms of affective experiences). Unfortunately, there are only a few studies that sys- 
tematically investigate the use of different time-based designs (with some rare exceptions; 
e.g., Stone et al., 2003), and researchers typically do not explain or justify their choice of 
time-based design. When studying instability, the empirical determination of time-based 
designs requires clarification, especially with regard to the length of time intervals. In 
other fields of ambulatory assessment, such as psychophysiology, the temporal dynamics 
of processes are well known, and these are considered when sampling rates are defined. 
Faster processes, such as those measured with electrocardiography, may be registered at 
rates of up to 2,000 Hz, whereas slower processes, such as skin temperature, are recorded 
at lower frequencies (4 Hz). 

In a recent paper, Ebner-Priemer and Sawitzki (2007) proposed several approaches 
to investigating the appropriateness of a chosen time-based design for the investigation 
of instability. By plotting the time series, as shown in Figure 23.1, the chosen time-based 
design and the within-person instability can be visually inspected (for a detailed explana- 
tion of Figure 23.1, please see the section on graphical description of instability, below). 
Unfortunately, identifying instability in the data does not in itself demonstrate that the 
process of interest is captured, and that there are not other, more important processes 
that might be faster or slower. 

Similarly, the absence of within-subject variability can result from no variability 
experienced, from a too high sampling frequency (as in the case of assessing affect each 
millisecond), or from a periodical (sinusoidal) process. If the frequency of the periodical 
process and the sampling frequency are exactly the same, the sample may contain only 
identical values, therefore showing no variability at all. (Imagine a sinusoidal process 
with 0.016 Hz, which is 1 cycle per minute. When monitoring this cycle with a sampling 
frequency of one assessment point per minute [0.016 Hz], one would reveal scores from 
the process always at the same point in the cycle; e.g., at the peak. This would result in 
similar values for each assessment point and, in the end, in a flat line instead of a sinu- 
soidal process.) 

Figure 23.2 presents another way of examining instability by plotting the frequency 
of score transition values from one time to the next. Figure 23.2A shows transitions from 
a rating at time ż to a rating at time ?+ 1. For example, the bottom right corner represents 
transitions from participants who rated their distress as 10 in a current self-report (t) and 
distress as 0 in the next self-report (t + 1). Therefore, the frequency of all changes from 
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FIGURE 23.1. Graphical description of a three-dimensional dataset covering time, subject, and 
the respective values of the variables of interest. Each row represents a subject and each square a 
self-report; the different degrees of shading denote the respective value (i.e., intensity of distress). 
The first 50 rows are data from patients with borderline personality disorder; the next 50 are data 
from healthy controls. Reproduced by permission from European Journal of Psychological Assess- 
ment 2007; Vol. 23(4): 238-247. © 2007 Hogrefe & Huber Publishers (www.hogrefe.com). 


430 DATA-ANALYTIC METHODS 


A B Cc 
BPD simulation Transition Frequencies, 
Lag 1 BPD Lag 20 BPD Sample 1 


10 
10 
10 


8 


©- 


8 


6 
6 


4 
4 


a 
Ko] 
a 
+ 
z 
n 
n 
oO 
< 
5 
wa 
To 


distress (t + 1) 


distress (simulated) 
2 


jo} 


0 2 4 6 8 10 4 6 8 10 
distress t distress t distress t 


10 


FIGURE 23.2. Graphical description covering the frequency of transitions using different time 
lags (A: lag = 1; B: lag = 20; C: simulated, randomly distributed data). Transitions from a dis- 
tress rating at time ¢ (x-axis) to a distress rating at time ¢ + x (y-axis) are plotted in each square. 
Reproduced by permission from European Journal of Psychological Assessment 2007; Vol. 23(4): 
238-247. © 2007 Hogrefe & Huber Publishers (www.hogrefe.com). 


10 to 0 is discernible in the bottom right corner. The frequency of the opposite pattern, 
transitions from 0 to 10, can be seen in the top left corner. In this example, the frequency 
of transitions is coded in color (black, grey, white), with the most frequent changes being 
represented in darker shades of color. To decide whether a time-based design fits the 
temporal dynamics of the processes of interest, Ebner-Priemer and Sawitzki (2007) used 
two different approaches. Both approaches share the assumption that tracking a process 
should result in data points that are related to each other over time. In other words, 
independent data over time might be caused by randomly picking states out of a given 
process, without monitoring the process itself. 

In the first approach, Ebner-Priemer and Sawitzki (2007) randomly distributed their 
data (simulation) to eliminate the within-subject time-based structure of the data. Sub- 
sequently, frequency of transitions for the simulated data (as in Figure 23.2C) was com- 
pared with the original data (as in Figure 23.2A) and with time-lagged data (as in Figure 
23.2B). This was done graphically and statistically. Visual inspection may be used as a 
first step to see differences between the original data and the randomly distributed data 
(with eliminated within-subject time-based structure). In our example (Figure 23.2A vs. 
23.2C), larger transitions (+2 and +3) are more frequent in the randomly distributed 
data. Comparing the percentage of transitions between different time lags may help to 
decide which time lags (+15 minutes, +30 minutes, +45 minutes) are still different from 
random data. Ebner-Priemer and Sawitzki found that for distress ratings, especially short 
intervals (15 minutes, 30 minutes) tapped a specific process. 

The second approach uses autocorrelation coefficients for increasing time lags as a 
numerical analogue to the graphical evaluation of the frequency of transitions. Again 
the basic assumption was to search for patterns that showed a time-based structure. 
Ebner-Priemer and Sawitzki (2007) revealed significant autocorrelations (i.e., a strong 
time dependency) for short time intervals (15 and 30 minutes) but zero correlations for 
time series with 2- or 4-hour intervals. The zero correlation indicates that successive val- 
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ues do not depend on one another, suggesting that these do not differ significantly from 
a random process. 


Choosing Analytic Indices 


A common pitfall in analyzing instability is the use of analytic strategies that do not (fully) 
capture instability. It is somewhat surprising that this still occurs, but just as researchers 
have neglected to understand psychological variables as time dependent, they have also 
neglected to consider whether their statistical strategies adequately capture instability. 

Instability has been described as addressing three general components: amplitude, 
frequency, and temporal dependency (Ebner-Priemer, Kuo, et al., 2007; Larsen, 1987). 
Instead of a three-component model, Trull and colleagues (2008) used a two-component 
model with variability and temporal dependency (basically a similar approach, as vari- 
ability covers frequency and amplitude). Analytic indices of instability should, of course, 
cover all three components. If the amplitude of a time series is increased, the mathemati- 
cal instability indices should reveal higher scores. Furthermore, the mathematical insta- 
bility indices should also reflect temporal (in)dependency. 

An example illustrates these points. An often used index for instability is the stan- 
dard deviation (Cowdry, Gardner, O’Leary, Leibenluft, & Rubinow, 1991; Links et al., 
2007; Russell, Moskowitz, Zuroff, Sookman, & Paris, 2007). This index has the advan- 
tages that it is familiar to most researchers and is used frequently. But does this index 
address the three components of instability? For illustration purposes, we plotted a sim- 
ple time series in Figure 23.3A. In Figure 23.3B, we plotted the same time series but with 
amplitude decreased by the factor 2. In Figure 23.3C, we plotted the same time series 
but with frequency decreased by the factor 2. And in Figure 23.3D, we plotted the same 
time series but redistributed the data points, which means that we changed the temporal 
order (we actually distributed the data points systematically in order to reveal a lower 
instability). On visual inspection of the figure, the higher instability in Figure 23.3A 
than in 23.3B, 23.3C, and 23.3D is apparent. Therefore, one would expect that true and 
valid instability indices should be highest for 23.3A relative to figures 23.3B, 23.3C, and 
23.3D. However, a comparison of standard deviation (SD) scores for the four time series 
does not reveal differences between 23.3A (original time series) and 23.3D (time series 
with altered temporal order): SD(Figure 23.3A) = 4.1; SD(Figure 23.3B) = 2.1; SD(Figure 
23.3C) = 3.0; SD(Figure 23.3D) = 4.1. The reason for this perhaps unexpected result 
is straightforward. The standard deviation is insensitive to temporal ordering of scores 
and therefore cannot differentiate between the stable and unstable patterns, as shown in 
Figure 23.3. 

Another frequently used index for the analysis of instability (Cowdry et al., 1991; 
Stein, 1996) is the autocorrelation (AC). A process with high temporal dependency should 
exhibit high autocorrelation scores and therefore a low level of instability. On the other 
hand, a truly random process would show a zero AC and therefore have a high level of 
instability. Calculating an AC coefficient (with a lag of 1) for the time series in Figures 
23.3A-D reveals the following scores: AC(Figure 23.3A) = -1; AC(Figure 23.3B) = -1; 
AC (Figure 23.3C) = 0.01; AC(Figure 23.3D) = 0.86. Clearly, the AC index does not reflect 
the amplitude differences in a time series (Figure 23.3A compared to Figure 23.3B). A 
reduction in amplitude (by a factor of 2) did not result in a different value for the auto- 
correlation, suggesting that the autocorrelation cannot differentiate between time series 


432 DATA-ANALYTIC METHODS 


A: Original Time Series B: Reduced Amplitude 


10 10 


oN e D © 
O N A Q © 


1 3 5 7 9 ll 13 #15 1 3 5 7 9 ll B 15 
C: Reduced Frequency D: Altered Temporal Order 
10 10 
8 8 
6 6 
4 4 
2 2 
0 0 
1 3 5 7 9 Wd B 15 1 3 5 7 9 1 B 15 


FIGURE 23.3. Original time series (A) and time series with reduced amplitude (B), reduced fre- 
quency (C), and altered temporal dependency (D). 


with large or small changes. This is because the correlation is computed by dividing the 
covariance of the two variables by the product of their standard deviations. Amplitude 
changes are canceled out by this procedure. 

We have used the preceding examples to illustrate the problem of some common 
indices used to analyze instability. Fortunately, many researchers are now aware that 
standard deviations are insensitive to differences in the temporal order of values, and 
that ACs are insensitive to differences in amplitude. But there is less awareness of the 
mathematical procedures for more computationally complex indices of instability. A mul- 
titude of mathematical approaches capture dynamic processes, such as hierarchical dif- 
fusion models (Kuppens et al., 2010), damped oscillator models (Boker & Nesselroade, 
2002; Chow, Ram, Boker, Fujita, & Clore, 2005), dynamic factor analysis (Wood & 
Brown, 1994), fractal dimension (Ebner-Priemer, Kuo, et al., 2007), power spectral den- 
sity (Ebner-Priemer, Kuo, et al., 2007), mixture distribution models (Eid & Langeheine, 
2003), mixture distribution latent state-trait analysis (Courvoisier, Eid, & Nussbeck, 
2007), and sequence analysis (Reisch, Ebner-Priemer, Tschacher, Bohus, & Linehan, 
2008), to name a few. Due to space limitations, we are not able to explain each math- 
ematical approach in detail or to discuss its advantages and disadvantages regarding 
the three components. Instead, we want to highlight the possibility to create time series 
that differ in frequency, amplitude, and temporal dependency (as shown in Figure 23.3). 
Using such time series enables researchers to decide which components of instability are 
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represented in each mathematical index (see Ebner-Priemer, Kuo, et al., 2007, for an 
overview). 

One index we find promising is the mean squared successive differences (MSSD) 
index (Ebner-Priemer, Kuo, et al., 2007), which, as its name implies, is the average of 
the squared difference between successive observations and addresses all three compo- 
nents of instability: amplitude, frequency, and temporal dependency. Compared to the 
MASD (mean absolute successive difference), the MSSD squares successive differences 
and therefore weights large changes higher compared to small changes (similar to a stan- 
dard deviation). The MSSD is a single-level index, as all successive differences per sub- 
ject are averaged. Jahng and colleagues (2008), as well as Trull and colleagues (2008), 
introduced a multilevel approach using squared successive differences (SSDs). As squares 
of successive differences follow a chi-square distribution, they use a generalized mul- 
tilevel model with a gamma error distribution and log link. Using multilevel modeling 
dramatically increases the possibilities for data analysis because time-varying covariates 
can be used (e.g., the influence of social interactions on instability of mood). In addition, 
Jahng and colleagues and Trull and colleagues adjusted their index for the time frame 
between two assessment points (ASSDs: adjusted squared successive differences). When 
using randomly spaced assessment points, which is done frequently in e-diary research, 
time intervals between assessment points differ. Unfortunately, a mood change might be 
interpreted differently when it occurred in 5 minutes compared to 5 hours. This bias is a 
result of a positive correlation between assessment interval and successive changes, and 
can be resolved by detrending, resulting in ASSDs. 

Because mood can be conceptualized as a two-dimensional process (e.g., composed 
of valence and arousal, as in Figure 23.4), it may be useful to calculate successive differ- 
ences in a two-dimensional space. Ebner-Priemer, Eid, Stabenow, Kleindienst, and Trull 
(2009) proposed two alternatives to measure the distance between successive points in 
a two-dimensional space. The Euclidean distance is using “airline distances” between 
successive differences in the two-dimensional space, whereas the City Block distance cor- 
responds to the distance a car would drive in a city laid out in square blocks (i.e., using 
rectangular connections between succeeding data points in the two-dimensional space). 
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FIGURE 23.4. Hypothetical examples of affective trajectories in a two-dimensional space for 
three single subjects. 
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It is important to note three possible cases for which time-sensitive indices may not 
be the best analytic choice. First, we caution against the use of time-sensitive indices 
when the protocol involves an infrequent (low-frequency) sampling. For example, when 
investigating the instability of parent-child interaction, a low sampling frequency (e.g., 
one interaction per week) would not be sufficient to characterize instability in social 
behavior. To treat such infrequent assessments of parent-child interaction as instances 
of successive social behavior would be arbitrary and potentially misleading. In cases in 
which the sampling frequency is too low to track the dynamics of the symptom of inter- 
est, we recommend using analytic methods that are not time-sensitive. 

Second, time-sensitive indices are clearly not advisable when the research question 
does not focus on dynamical processes. This might be the case when the research ques- 
tion refers strictly to the variability of behavior and not to instability. For example, a 
researcher might be interested in clinging behavior of children. Ambulatory assessment 
might be used to assess whether clinging behavior is present in almost all social interac- 
tions, or whether there is some evidence of variability in this behavior. In this example, 
the variability of behaviors is of primary interest, independent of the temporal order or 
sequence of the behavior. 

Third, when missing data points are prevalent, time-sensitive indices may not be 
the best analytic choice. Missing data are usually infrequent in ambulatory assessment 
studies when electronic devices such as e-diaries are used. For example, Ebner-Priemer, 
Welch, and colleagues (2007) reported a rate of 4% over 24 hours, and Trull and col- 
leagues (2008) reported a missing data rate of 13.5% over 28 days. However, there have 
been studies with as much as 42% missing data (Links et al., 2007). Minimal missing 
data are critical for an instability index that compares successive values, for example, 
the MSSD index. A missing data rate of 42% will result in up to 80% missing successive 
differences. This is because one missing data point can lead to two missing transitions 
(i.e., before and after). Accordingly, in the case of multiple missing data, in assessing vari- 
ability we strongly recommend the use of robust methods that do not assume the presence 
of temporally sequential data. 


Graphical Description of Instability: 
Choosing Three-Dimensional Graphics 


The use of traditional figures (e.g., bar charts, line plots) to plot research data has the 
disadvantage that data are usually aggregated. Although such figures do present two 
dimensions of data, such as group and value, they cannot portray three-dimensional 
relationships that involve time, participant, and the respective values of the variables of 
interest. Ebner-Priemer and Sawitzki (2007) proposed a graphical description for such 
three-dimensional datasets. 

Figure 23.1 plots the results of an e-diary study. Fifty patients with BPD and 50 
healthy controls (HCs) answered e-diary queries every 15 minutes over the course of a 
day, resulting in approximately 50 data points per subject. Each row represents a subject 
(BPD = rows 1-50; HC = rows 51-100), each square is an e-diary self-report, and the 
color denotes the level of distress (bright = low distress, dark = high distress). The upper 
half of the figure is dominated by bright colors, representing low levels of distress in the 
HCs. The lower half of the figure is more dark, demonstrating the medium and high 
distress ratings of patients with BPD. The fast-changing colors in the lower half of the 
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figure represent high instability in the patients with BPD (we used R for plotting data and 
transitions [R Foundation for Statistical Computing, 2006]; R syntax and software tools 
can be downloaded free of charge at www.ambulatory-assessment.org). 

Inspiring examples for graphically visualizing time courses in a two-dimensional 
affective space have been reported by Kuppens and colleagues (2010). Affective trajecto- 
ries were plotted in a two-dimensional space, showing directly how affect changed over 
time in a single subject (see Figure 23.4 for hypothetical data): (A) a subject with general 
good mood, who had one bad experience during the sampling period; (B) a subject with 
huge instability in valence over time; and (C) a subject who had good mood at the begin- 
ning but bad mood at the end of the sampling period. Furthermore, pooled vectors plot- 
ted on a two-dimensional affective grid showed the velocity of affective changes. 


Conclusions and Recommendations on How to Integrate 
Temporal Dynamics and Instability into Your Research 


In this last section, we summarize our basic recommendations for researchers seeking to 
study temporally unstable and dynamic processes in everyday life. 


1. Is time or temporal order relevant to your research question? Depending on the 
research question, the temporal order may or may not be of interest. For example, if the 
number of cigarettes smoked is the focus of one’s research question, the modeling of time 
or temporal order might not be interesting or necessary. However, if the research question 
focuses on relapses in smoking and how craving for cigarettes evolves and exacerbates or 
abates over time, then clearly time is a crucial factor. In the latter case, indices addressing 
temporal order might be used to characterize such a dynamic process. 


2. What is an appropriate method for data collection? Ambulatory assessment is 
an especially suitable method for investigating dynamic and unstable constructs because 
it enables tracking of the ebb and flow of the variable of interest over time. This can be 
thought of as capturing the film of ongoing life rather than a snapshot of it. Retrospec- 
tive questionnaires and interviews are limited in their ability truly to assess instability as 
experienced over time in everyday life. 

When time is considered and integrated into the research question, into the sampling 
method, and into the data-analytic strategy, the question of timing (promptness) of the 
participant’s responses is all the more critical. In this case, we recommend procedures 
that can evaluate timely compliance to prompts, such as electronic diary devices that 
time-stamp responses (for discussion regarding electronic vs. paper-pencil diaries, see 
Green, Rafaeli, Bolger, Shrout, & Reis, 2006; Stone et al., 2002; Tennen, Affleck, Coyne, 
Larsen, & DeLongis, 2006). 


3. Which sampling protocol is recommended? In line with recent recommendations 
(Bolger et al., 2003; Ebner-Priemer & Sawitzki, 2007; Fahrenberg & Myrtek, 2001; Reis 
& Gable, 2000; Shiffman, 2007), we propose that the sampling protocol must follow 
from the research question. We suggest the application of event-contingent recording 
for rare or episodic events (e.g., social interactions), whereas time-contingent recording 
is better suited to continuous phenomena (e.g., affect). Combined sampling protocols 
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(Shiffman, 2007) are ideal for investigating the interplay between events and continuous 
phenomenon. 


4. What is an appropriate time-based design? The time-based design, which is cru- 
cial for research in ambulatory assessment, is defined by the number of assessment points, 
the intervals between assessment points, and the total length of the assessment period. 
Again, the time-based design must fit the temporal dynamics of the processes of interest 
to avoid difficulties and misleading results. Unfortunately, the temporal dynamics of most 
psychological variables and processes are unknown. In this case, we recommend that the 
researcher begin by “oversampling” in a pilot study. This is in line with Warner (1998), 
who suggested initial sampling at the highest frequency, and with Bolger and colleagues 
(2003), who advised researchers to err on the side of shorter intervals. On the other hand, 
the number of self-reports and the time interval between self-reports are limited by the 
increased burden on the participants. If compliance decreases over the total assessment 
period, as evidenced, for example, by an increase in missing data, the total assessment 
period may be too long, or the number of prompted self-reports too high. A graphical 
inspection of the data presented in Figures 23.1 and 23.2 provides a good overview of the 
spread of data over the whole assessment and between intervals. 


5. How to choose an appropriate analytic strategy? According to our conceptualiza- 
tion, instability comprises three general components: amplitude, frequency, and temporal 
dependency. A failure to consider any one of the components in the analytic strategy can 
lead to misleading conclusions. Temporal dependency in particular is vitally important, 
but many of the indices used frequently for analysis of “instability” (e.g., standard devia- 
tions) have failed to address adequately the temporal sequencing of states. Because the 
mathematical basis of more complex indices of instability might be unknown, we recom- 
mend evaluating which components of instability the indices accurately reflect. This can 
easily be done using artificial time series, as shown earlier. But there are conditions under 
which time-sensitive indices may not be the best analytic choice, when, for example, the 
research question does not focus on dynamic processes, when the sampling protocol calls 
for relatively infrequent assessments, or when the dataset contains much missing data. 


We hope that researchers, taking these basic recommendations into account, can 
further improve our understanding of the dynamics of everyday life. Even though it can 
be burdensome to iteratively search for appropriate analytic techniques and assessment 
strategies for investigation instability, we are convinced that it is “worth the trouble.” 
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CHAPTER 24 


Modeling Nonlinear Dynamics 
in Intraindividual Variability 


PASCAL R. DEBOECK 


his chapter introduces one approach to describing constructs that consist of easily 

reversed, short-term changes—data that are often dismissed as “error” or too com- 
plex to model. Short-term, relatively reversible changes in constructs have been described 
as intraindividual variability by Nesselroade (1984, 1991) and as temporal instability 
by Ebner-Priemer and Trull (Chapter 23, this volume). This chapter begins with tools 
for describing rates of change, the idea of calculating rates of change on many small seg- 
ments of a time series, and, finally, an example of how the current state of a variable can 
be related to how it is changing. Differential equation modeling holds promise for the 
modeling of intraindividual variability because of the way it allows researchers to under- 
stand the relationships governing momentary changes in a time series. Two methods for 
fitting differential equation models are described, using ordinary linear regression and 
structural equation modeling. The chapter ends with code for each of the two methods, 
written for the free statistical program R, and analysis of a sample time series. 

As an example, imagine that I have completed a positive affect measure every day 
for the last month. I have not experienced any dramatic events during this month, and 
my days have generally followed a typical routine. Examining my data, one would prob- 
ably not expect there to be a linear trend over time, but one would observe lots of vari- 
ability, as in Figure 24.1. This variability seems so complicated that is is often given a 
special name and then dismissed—“error variance.” But we know that these fluctua- 
tions are not completely error: Today I am a little like I was yesterday, and a little bit 
like I will be tomorrow. If this statement is true to any degree, even if only partially 
and with exceptions, it suggests that observations of my positive affect over time will 
be correlated. For these observations truly to consist of error, most statistical methods 
assume that every observation is completely independent from all others. If I am at all 
self-similar on adjoining days or due to a weekly cycle, the assumption of independence 
is violated. 

The modeling of such data might seem a complex endeavor, perhaps because the 
data look complex, or because methods for doing so are not a regular part of statistical 
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Day 


41 42 43 


Positive Affect Score 
40 


38 39 


FIGURE 24.1. An example of data similar to daily measures of Positive Affect. 


training at this time. In this chapter I present a method that may help to extract informa- 
tion from time series consisting of intraindividual variability (for another approach, see 
Ebner-Priemer & Trull, Chapter 23, this volume). In doing so we begin to look at data 
from a different perspective—one in which the key differences between people are not in 
mean levels, but in how people are changing. The method introduced in this chapter is 
differential equation modeling. These words often seem to elicit a response that, if paired 
with background music, would be well matched with the theme from The Omen. How- 
ever, there are applications of differential equation modeling that are no more compli- 
cated than ordinary linear regression or a simple structural equation model. I begin with 
two key ideas that can dramatically change how we think about modeling an individual’s 
time series: derivatives and embedding. 


Derivatives 


Derivatives can be used to describe how one variable is changing with respect to changes 
in another variable. Imagine watching someone running along the beach. As this person 
runs, information is collected on two variables: the person’s position, and when each 
position occurs (time). (Being diligent researchers, prior to the person’s arrival we marked 
equal intervals on the beach so that we could gauge his or her position.) One thing we 
might be curious about is how fast that person is running: We might be interested in the 
amount of change in the person’s position with respect to the amount of change in time. 
We could calculate a change in position (p, — p,) and divide it by the change in time (t, - 
t,). The person’s speed that we have calculated is an estimate of the first derivative. Most 
of us were introduced to the first derivative very early in our education when we calcu- 
lated “rise over run,” or the change in y over (i.e., with respect to) a change in x. 
Interested in whether the person is getting winded, we might decide to calculate the 
person’s speed—v, = (p, - p,)/(t, — t;) wait a few seconds, then calculate his or her speed 
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again—v, = (P4 - P;)/(t4 - t,). If we were to subtract the two speeds (v, — v,) and divide by 
the time separating the two speed estimates (t; ; — t4 5; i.e., the center of the previous time 
intervals), we would get an estimate of how quickly this person’s speed was changing with 
respect to time, that is, whether the person was accelerating or decelerating. The change 
in the first derivative with respect to time (acceleration) is the second derivative. The one 
derivative we often observe, rather than calculate, is the zeroth derivative, which is the 
person’s position (p,) at some time (t4). 

Derivatives can be used to examine the change of any two variables with respect to 
each other, but frequently we are interested in changes of some measured variable with 
respect to time. If we imagine a quadratic growth curve, we can see measurements on 
a psychological construct in terms of the zeroth, first, and second derivatives. At some 
point on the curve, perhaps the middle, is an intercept—the score at a given point in time, 
or we could also call it the zeroth derivative. But looking before and after that point, the 
scores are changing. At that moment the scores have a particular speed with which they 
are changing—a tangent line would indicate this speed; this is the first derivative. But 
we can also imagine that the scores follow a curved line—the speed is changing with 
respect to time. This acceleration, or the second derivative, can be seen in how the slope 
changes just before and after the point being examined. So at any point on a quadratic 
growth curve we have the score at that time (zeroth derivative), and that score will have 
some speed at which it is changing (first derivative), and the speed may be increasing or 
decreasing (second derivative). The intercept (score at some time), linear slope (speed), 
and quadratic slope (curvature) in growth curve modeling are similar to derivatives. 


Embedding 


The second key idea is that of embedding or, more commonly, embedding dimension. 
The literature on embedding dimension is an extensive one that frequently coincides with 
ideas related to linear and nonlinear dynamical systems. One of the key ideas related to 
embedding dimension regards the number of dimensions required to represent a time 
series. The study of embedding dimension can lead to some complex ideas, more compli- 
cated than needed for our purposes. We can glean the two primary points needed for our 
purposes by understanding how an embedded matrix is created and what the embedding 
dimension determines. 

Creation of an embedded matrix consists of combining replicates of a time series that 
have been offset in time. Imagine that we have a time series with eight observations: x4, 
X2, +++5 Xg (the leftmost set of boxes in Figure 24.2). For simplicity let us assume that these 
observations occur at equally spaced intervals, where At is the time between measure- 
ments. We could then copy this time series some number of times, placing the replicates 
next to each other but offsetting them by some amount of time. In Figure 24.2 (center) 
four replicates of the time series have been created and placed next to each other, offset 
by At (i.e., one observation). After trimming incomplete rows, the result is the embedded 
matrix on the right side of Figure 24.2. 

We can now begin to understand how one goes about selecting an appropriate num- 
ber of columns (time series replicates), which corresponds to the embedding dimension. 
One consideration is whether the embedding dimension is large enough to reconstruct the 
dynamics of the system of interest (are two columns sufficient, or does one need six col- 
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FIGURE 24.2. Example of taking a time series (left), embedding the time series (center), and 
creating an embedded matrix (right). The embedding dimension of this matrix is 4. 


umns?). To understand this question, think about a single row of the embedded matrix. 
Each row is used to estimate derivatives at a particular time. If we have a long time series, 
such as in Figure 24.3, each row of the embedded matrix represents a small segment of 
this time series (figure insert). From each of these small segments we estimate derivatives. 
This process is akin to fitting many small growth curves along a time series. With growth 
curves, we know that it takes at least two observations to estimate a line (three, if we 
want to estimate error), or three observations to estimate the quadratic component (four, 
if we want to estimate error). In the same way, the embedding dimension is related to the 
maximum derivative that can be estimated. For example, with an embedding dimension 
of two (i.e., two observations per row) it would be impossible to estimate the second 
derivative (curvature, related to the quadratic term). Therefore, when we are interested in 
fitting a specific model to a time series, the highest order of derivative in the model places 
a lower limit on the embedding dimension. 

In the other direction, we could increase the size of the time series segment used to 
estimate each derivative; this could be accomplished by either using a large embedding 
dimension or changing the time offset between adjacent columns (e.g., shift the columns 
in Figure 24.2 by two or more observations rather than one). As we imagine the window 
in Figure 24.3 growing larger, clear consequences emerge. First, a larger window allows 
us to examine more true score variance, which will allow us to get better estimates of the 
derivatives because the proportion of error variance will be smaller. However, another 
consequence is that if the window is too large, we will begin to average over change of 
interest—hills and valleys might differ little from straight lines with a large enough win- 
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FIGURE 24.3. This figure depicts the analysis of many small segments of a time series— 
accomplished using derivatives and an embedded matrix. 


dow, neglecting the rich change that is occurring. In selecting an embedding dimension 
we must consider not only the model we wish to fit (i.e., the maximum order of derivative 
to be estimated) but also how our sampling rate (number of observations made per unit 
time) compares to how quickly our system of interest is changing. One wants to get good 
estimates of true score variance in each row, without averaging over change of interest. 

An embedded matrix is a way to rearrange one’s data. However, the consequences of 
doing so are remarkable, as the embedded matrix moves models from fitting a single set 
of derivatives to an entire time series toward fitting derivatives to little segments of data 
throughout the entire time series. The derivative estimates, like a series of small growth 
curves, capture the relatively short-term fluctuations that occur in a time series. 


Differential Equation Models 


The previous sections have highlighted two concepts: derivatives and embedding. Deriva- 
tives are a tool to describe rates of change. Embedding is a tool used to recreate the 
dynamics of a time series using small segments. The combination gives us a way to 
describe the short-term fluctuations occurring across a time series. The questions then 
follow: What predicts the rate at which a person is changing at any given time? Why is it 
at some points the scores increase and at other times decrease? At some points accelerate 
and at other times decelerate? 

Answers to these questions can take many forms. Differential equation models try 
to answer these questions by examining relationships between derivatives from one or 
more time series. Any model that uses derivatives is called a differential equation model. 
For example, imagine that we are examining the daily positive affect scores of an indi- 
vidual over time. Over a short time, we might not expect there to be an overall trend in 
any particular direction (assuming there was not some sort of major life event). But the 
person might have an average or typical level of positive affect around which there are 
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daily fluctuations. We might postulate that there is a certain amount of self-regulation in 
these fluctuations, such that if a person’s affect gets far from his or her typical level, then 
the person is likely to gravitate back toward his or her typical level. How do we put this 
into an equation? How do we write a differential equation model? 

If we imagine removing the “typical” level (equilibrium) from an individual’s time 
series, we would see fluctuation around a value of zero, as in Figure 24.4. When positive 
affect gets far above equilibrium, there is a “pullback” to equilibrium (gray arrows). That 
“pullback” can be described as a change in the first derivative. When a high score occurs, 
rather than continuing at its current speed upward, the positive affect speed changes such 
that it slows down, then eventually reverses direction. This is a change in the first deriva- 
tive with respect to time, which was discussed earlier as the second derivative. So we could 
think about a model where the zeroth derivative (score) is related to the second derivative 
(change in speed), for example, d’x/dt? = Bx. This says that the second derivative (d?x/dt?) 
is equal to the zeroth derivative (x) times a constant ß. By using an embedding matrix, 
this relationship is expected to occur for many small segments in the time series. As the 
model for positive affect has been described, one would expect ß to be negative, because 
positive scores (i.e., above zero) would be related to acceleration in the negative direction, 
and negative scores would be related to acceleration in a positive direction. The relation- 
ship also dictates that the magnitude of the acceleration will depend on the magnitude of 
x. When ß is large, the return to equilibrium will be very rapid, while a P closer to zero 
would suggest a much slower return to equilibrium. 

This model is well known in the physical sciences as the model for a linear oscillator, 
or pendulum. Sometimes an additional term is included: 


dèx 

dt? 
This additional term allows for a relationship between the second and first derivatives 
(dx/dt). The term describing the relationship ß, is the amount of damping. This term 
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FIGURE 24.4. A depiction of the idea of self-regulation. Large departures from equilibrium 
(horizontal dashed lined) are more likely to experience forces that will move the score back toward 
the equilibrium (gray arrows). 
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allows the amplitude of a pendulum (how far it swings to the right and left) to increase or 
decrease over time. Real pendulums frequently experience a decrease in amplitude over 
time due to friction. The rate that the pendulum returns to equilibrium ß, is related to 
the frequency at which the pendulum oscillates. Please note that in psychological publica- 
tions, one will often see Bọ represented with n and ß, with C; the different symbols con- 
vey the same information (e.g., Boker, Leibenluft, Deboeck, Virk, & Postolache, 2008; 
Montpetit, Bergeman, Deboeck, Tiberio, & Boker, 2010). 

One particularly interesting feature of Equation 1 is that the construct being exam- 
ined is not modeled as a function of time. Often with longitudinal data social scientists 
are accustomed to seeing the dependent variable written as a function of time t, perhaps 
something such as y = Bo + B,¢. This is not the case with differential equation models. 
Consequently, models such as Equation 1 have no “average” or “prototypical” trajec- 
tory over time—this model does not provide the researcher with the specification of a 
line that can be drawn through the points of a time series. That is, the model specifies 
relationships between the current state of one or more variables and how those variables 
are changing, but the model does not specify a specific trajectory over time. Equation 1 is 
the same as that for a damped linear oscillator, but it does not imply that we expect the 
construct to change in some perfectly cyclic or oscillating pattern; it only conveys that if 
Bo is negative, then as one departs from equilibrium there is some tendency to return to 
equilibrium. 


Implementing 


The previous sections introduced the ideas of derivatives, embedding, and differential 
equation modeling; that is, they provided tools for describing rates of change (deriva- 
tives); the idea of calculating rates of change on many small segments of a time series 
(embedding); and, finally, an example of how the current state of a variable can be related 
to how it is changing (differential equation modeling). Now that the concepts have been 
presented, this chapter addresses how a differential equation model can be fit to a time 
series. Two different approaches are presented—one based on creating observed deriva- 
tive estimates, the other based on creating latent derivative estimates. Both methods use 
an embedded matrix X to estimate the derivatives in matrix Y using a set of constants, 
defined in the weight matrix W. In this section I discuss the implementation of each of 
the two methods. In the section that follows, both methods are applied to a sample time 
series. 


Observed Derivative Estimates 


One method for the creation of observed derivative estimates is using the method of gen- 
eralized orthogonal local derivative (GOLD) estimates (Deboeck, 2010). In this method, 
the embedded matrix (X) is multiplied by a weight matrix (W) to solve for a matrix of 
derivative estimates (Y). GOLD estimates stem from the idea of estimating orthogonal 
trends, but this idea is applied to the individual rows of an embedded matrix rather than 
an entire time series. The application of GOLD can be described using four steps: (1) 
embed the time series, (2) calculate W, (3) multiply XW using matrix multiplication, and 
(4) use the resulting estimates Y to fit a differential equation model. 
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Step 1 is described in the section “Embedding,” which highlights several things 
that should be considered when choosing the embedding dimension. The calculation of 
the weight matrix W is introduced here. Later in this chapter, code is provided to help 
researchers calculate this matrix. If we consider the value of a score at some time to be 
equal to the score at some previous time, the rate at which the score is changing, and the 
rate at which the scores are accelerating, we would produce the formula 


m ¿1 dx 
ait 3 737° (2) 


Xt= Xot+ 


This equation says that the score at some time (x,) is equal to the score at a pivots time 
(xo), plus the speed at which scores are changing multiplied by the elapsed time (dern), 
plus one-half of the rate at which the scores are accelerating multiplied by the elapsed 
time squared ( (149% 22), If we think about the formula for a polynomial trend, 


x, = Bo + Bit + Bot? (3) 


there appear to be parallels between estimating polynomial coefficients and Equation 
2. In fact, as discussed before, derivative estimates are related to linear and quadratic 
trends—just applied to many small pieces of a time series. It is precisely this relationship 
that is used to calculate GOLD estimates. 

For completeness, the three paragraphs that follow describe the calculation of the 
GOLD weight matrix (Woo; p). It is not critical to understand the content of these para- 
graphs, but it is important to understand that the GOLD weight matrix is designed 
to estimate a series of orthogonal derivatives much like one could estimate a series of 
orthogonal polynomial trends from a time series. (Readers less interested in the deriva- 
tion of the weight matrix can skip the next three paragraphs.) 

The GOLD weight matrix Wego, p is calculated using two matrices: D and ©. The 
matrix D is a diagonal matrix of scaling constants that converts the polynomial estimates 
to derivative estimates; as can be seen in Equations 2 and 3, ß, will be twice as large as 
d?x/dt?, so the D matrix will multiply PB, by % in order to produce a derivative estimate. 
More generally, D is a square matrix with diagonal elements equal to 1/q!, where q = 0, 
1, ..., up to the number of orders of derivatives being estimated, and ! is standard nota- 
tion for a factorial. 

The = matrix calculation is based on a formula for the estimation of orthog- 
onal pe ene Each row of = is calculated in succession using the equation 
Et 1 E Ep È; Õpiti DEE], where ¢ are the times of observation (e.g., t = {1, 2, 3, 
4, 5}) = the measurements in a particular row of X, q (q = 1, ... , m) represents the order 
of the polynomial, and i takes on integer values from 1 through the number of observa- 
tions used to estimate a particular trend (adapted from Narula, 1979). This equation is 
used to produce a set of coefficients that can be used to estimate orthogonal polynomi- 
als. 

When Y is multiplied by D and &, the result is X. Solving for Y, so that observed 
derivative estimates are produced, 


X =YDE (4) 


Y = X(DE)’[(DE)(DE)’]" = XWeorp (5) 
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As an example, if the embedding dimension of X is equal to 3, and one is interested in 
calculating up to the second derivative (three orders—zeroth, first, and second), then 


| 10 1.0 1.0 1 0 0 
=-| 100 0.0 10a D- 0 ı 0 
L (1/3)t2A2 (-2/3)22A2 (1/3) 22 0 0 ml 
and 


1/3 -1/(2tA) 1/22) 
Weorn = 1/3 0 -2 /(1242) 
1/3 W(2tA) WER) | 


where t is the number of observations that adjacent rows of the embedded matrix are off- 
set, and A is the time between observations. Code is provided in the “Example” section 
to automatically calculate Weorp. 

This brings us to Step 3, where X and Wgorp are multiplied to form derivative esti- 
mates. Consider an embedded matrix with a single row equal to {4, 5, 6}. Assuming T = 
1 and A = 1, then 


Wa i 
is u2 1l 
= [Grit 9 (#+0+9 aala © 


The estimate for the zeroth derivative (score) of the series {4, 5, 6} is 5, the estimate of 
the first derivative (rate of change) is 1 unit for every unit of time, and the estimate of the 
second derivative (change in rate of change) is 0. 

If the time series is long, then X will have many rows and Y will have many sets of 
derivative estimates. The first, second, and third columns of Y will correspond to the 
zeroth, first, and second derivative estimates. With the estimated derivatives, differential 
equation models can be fit using common regression analyses. For example, one could 
enter the third column of Y as a dependent variable and the first two columns of Y as pre- 
dictors (i.e., Equation 1) into any program that performs regression. One could also use 
multilevel modeling software to model multiple individuals and to examine predictors of 
the differential equation model parameters. 


Latent Derivative Estimates 


An alternative way to estimate derivatives is by creating latent derivative estimates (LDEs; 
Boker, Neale, & Rausch, 2004). In this method a set of latent scores Y is multiplied by a 
matrix of constants W, resulting in the observed values of the embedded matrix X; that 
is, LDEs begin with the equation X = YW, pr. The application of LDEs is very similar to 
the process for GOLD estimates: (1) embed the time series; (2) calculate W; (3) create a 
structural equation model (SEM) with latent derivatives; and (4) fit a differential equation 
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model by drawing paths between the latent estimates. As before, Step 1 is described in 
the section “Embedding.” 

The creation of the LDE weight matrix W differs from GOLD estimates. For LDE 
we consider the creation of the weight matrix for each derivative in turn—these weights 
correspond to the paths from the latent variables to the manifest variables, which consist 
of the individual columns of the embedded matrix. The zeroth derivative is calculated by 
setting all paths equal to one (like a latent intercept). The paths for the first derivative 
are selected so that they have a mean of zero and increment the same amount of time as 
adjacent observations in a single row of the embedded matrix. For example, if one had 
weekly measurements and an embedded matrix of dimension 5, and adjacent columns of 
the embedded matrix were offset by only one observation, one could specify the paths 
for the first derivative as {-14, -7, 0, 7, 14} if one wanted time scaled in days (note that 
the mean of these values is zero). Alternatively one could set the paths equal to {-2, -1, 
0, 1, 2}, which would scale the results in weeks rather than days. The second derivative 
paths are equal to the paths of the first derivative, squared and divided by 2. The second 
derivative paths, using the first derivative paths of {-14, -7, 0, 7, 14}, would be equal to 
{-147/2, -77/2, 07/2, 77/2, 14°/2} = {98, 24.5, 0, 24.5, 98}. Higher-order derivatives can 
also be calculated.! 

Once the paths to the latent variables are calculated, which make up the contents of 
W, pr, an SEM can be built. The final step is to include paths between the latent variables, 


FIGURE 24.5. Structural equation model for creating latent derivative estimates and fitting the 
damped linear oscillator model. The paths to the columns of the embedded matrix (X,, X, X3, 
X,, and X,) are fixed, as they are with latent growth curve modeling. The parameters Bọ and Bı 
correspond to frequency and damping parameters. 
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thereby completing the differential equation model. Figure 24.5 shows an example of the 
differential equation model of a damped linear oscillator (Equation 1). In this figure X4, 
X,, X3, X,, and X, are each one column of an embedded matrix with embedding dimen- 
sion 5. The paths to the latent variables are fixed to constants as described in the previ- 
ous paragraph (values of W; pp), thereby identifying which latent variables correspond to 
which derivatives. The paths between the latent variables define the relationships for the 
differential equation model; in Figure 24.5 the latent zeroth and first derivatives are used 
to predict the latent second derivative estimates.” 


Example 


In this section GOLD and LDE are used to analyze the same set of data. Rather than 
select data that are inaccessible to the reader, a simulated set of data is used. These data 
are loosely based on daily measurements of positive affect, and are intended to represent 
a time series collected from a single individual. The code presented is for the statistical 
program R (R Development Core Team, 2011). In many cases the code presented is nei- 
ther the most efficient nor elegant from a programming perspective, but has been laid 
out in such a way that the individual steps are easier to digest. Many other programs 
can perform the necessary computations; R is used in this chapter because it is freely 
available on the internet—one less reason not to try this out.’ First we need to enter our 
data into R: 


x <- c(40.1, 40.6, 41.9, 41.9, 40.7, 39.4, NA, 39.9, 37.4, 39.8, 
38.8, NA, 41.9, 39.3, 41.1, 40.4, 39.1, 39.0, 39.7, 39.6, 
39.9, 41.3, 41.2, 41.9, 39.3, 40.3, NA, 40.0, 39.4, 41.3, 
39.3, 42.1, 40.7, 40.9, 40.4, 39.4, 38.4, 39.1, NA, 40.7, 
40.7, 40.8, NA, 40.5, 42.9, 38.3, 38.4, 38.3, 40.1, 39.3) 


# The characters <- assign the values on the right to the 
# object name on the left 


After hitting return, a vector 50 observations in length is created, with missing val- 
ues indicated using the symbol “NA.” For more on reading data into R, the R website 
(www.v-project.org) includes a manual entitled R Data Import/Export. Please note the 
use of the comment character # in the R code; anything written after the # character will 
not be run by R and has been included to explain the code. 

For the damped linear oscillator model one must set the equilibrium (i.e., the “typi- 
cal” state of the person) to zero, prior to embedding the time series. This requires careful 
theoretical consideration as to whether the equilibrium may be changing over time, or 
whether it is constant. For example, in a sample of recent widows, it was expected that 
there would be an upward trend in the equilibrium on a measure of emotional well- 
being (lower scores indicate poorer well-being) for the 90 days following the death of the 
woman’s husband (Bisconti, Bergeman, & Boker, 2004, 2006). However, measurements 
of positive affect across a series of more typical days might not show such a change in 
equilibrium, in which case removal of a constant equilibrium (e.g., the mean) might be 
sufficient. The selection one makes can impact the results of this analysis, so a carefully 
laid out argument is required for why the equilibrium was modeled in a particular way. 
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By fitting a quadratic model, we see no evidence in these data for a linear or quadratic 
trend, but there is a significant intercept. We therefore begin our analysis by removing the 
mean from the data and creating an embedded matrix with embedding dimension 4, with 
columns offset by a single observation: 


x <- x-mean(x,na.rm=TRUE) 

#na.rm will remove any missing values when calculating the mean 

len <- length(x) #the length of the vector x is save as ‘len’ 

embeddedX <- cbind(x[1: (len-3)],x[2:(len-2)],x[3:(len-1)],x[4:len]) 
#cbind will create a matrix by using the listed vectors as columns 
#that should be bound together; square brackets after x allow for 
#selection of specific values from the x vector. 


Typing the variable name (embeddedx) into an R console and hitting return allows us to 
see the embedded matrix. It is here that GOLD and LDE diverge. 


Observed Derivative Estimates (GOLD) 


For GOLD, the next step is the calculation of the matrix Woo, p. While not immediately 
apparent, the following function implements the steps described earlier: 


ContrastsGOLD <- function(T,max) { 

Xi <- matrix(NA,max+1,length(T) ) 

for(r in O:max) { Xi[r+1,] <- (T^r) 
if(r>0) { for(p in 0:(r-1)) { 
Xilr+l,] <- Xilr+l,] - 
(Xifp+1,]*(sum(Xil[p+1,]*(T*r)))/(sum(Xilp+1,]*(T*p)))) J} 

DXi <- diag(1/factorial(c(0:max)))%*%Xi 

t (DXi) %*%solve (DX1%*%t (DXi)) 

} 


To use this code to get Weo,p, we must provide it two pieces of information: T and 
max. The first of these, T, is a vector of values that is the same length as the embedding 
dimension and has values that correspond to the spacing of time for any single row of the 
embedded matrix. For example, we could say Tis equal to c(1, 2, 3, 4)—four values 
(corresponding to an embedding dimension of 4) incrementing by one unit each time (cor- 
responding to daily measurements). For this function c(1, 2, 3, 4) is the same as c(0, 
1, 2, 3),orevenc(-1.2, 0.2, 1.2, 2.2).4 The other piece, max, is the maximum order 
of derivative that should be calculated. To calculate a matrix for the zeroth, first, and 
second derivatives, one would set max equal to 2. So proceeding to get Weorp: 


Wgold <- ContrastsGOLD(T=c(1,2,3,4),max=2) 


# get the weight matrix given T 
# (see above) and the max order of derivative desired (2) 


The next step is to use matrix multiplication (%*%) to multiply the embedded matrix X 
with the weight matrix Woo; p: 


DerivativeEstimates <- embeddedX %*% Wgold 
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which produces the matrix of derivative estimates Y. We could then separate the columns 
of this matrix and run a regression to fit the damped linear oscillator model: 


zeroth <- DerivativeEstimates[,1] 

the square brackets remove the first column from DerivativeEstimates 
and assign it to be equal to an object named ‘zeroth’ 

first <- DerivativeEstimates[,2] 

second <- DerivativeEstimates[,3] 

DLO <- 1lm(second~first+zeroth-1) 

run a linear regression with second predicted by (~) first plus (+) 
zeroth removing the intercept (-1). Save the result as ‘DLO’ 
summary(DLO) # Get a summary of the object ‘DLO’ 


The function 1m() is the R function for fitting a linear model (ordinary regression). 
One thing to note is the use of -1 when fitting the damped linear oscillator model. This 
command instructs R not to estimate an intercept for the model, as the model does not 
include an intercept. The following is part of the output produced by the summary com- 
mand: 


Coefficients: 

Estimate Std. Error t value Pr(>|t|) 
first -0.1355 0.2730 -0.496 0.6240 
zeroth -0.4868 0.2180 =2,233 0.0348 * 


The value following first is the estimate of B,, or the damping parameter; the value 
following zeroth is the estimate of Bo, or the frequency parameter. The damping param- 
eter is not significant, suggesting that it is not clear that the amplitude of these data is 
changing over time; had it been significant, the negative value would suggest that the 
amplitude was decreasing over time. The parameter Bo can be converted to a measure of 
wavelength in order to get some idea as to how quickly these data are moving toward and 
away from equilibrium. This can be calculated as follows: 2n/\-B, = 2n/N0.487 = 9.0, 
where A is the time between equally spaced observations. This suggests that if perfect 
oscillations were occurring, which is not required by this model, these data would take 
9.0 days to complete a full cycle (peak to peak of a sine wave). 


Latent Derivative Estimates 


For LDE, we first need to determine the paths from the latent derivatives to the observed 
columns of the embedded matrix. The paths for the zeroth derivative are all equal to one. 
For the first derivative we need a series that reflects the time spacing and has a mean of 
zero; for daily measurements we could select the values -1.5,-0.5, 0.5, and 1.5 (recall the 
embedding dimension is 4). This series of values would scale time in units of days. The 
second derivative is calculated as the square of this series divided by two, so: 


wlde <= cbhind(c(1,1,1,1),c(-1.5,-0.5,0.5,1.5),c(1.125,0.125,0.125,1.125)) 


With the paths from the latent variables to the observed values of the embedded matrix 
specified, the remainder of the application of LDE just requires the specification of the 
model shown in Figure 24.5 (with different constants for the paths and an embedded 
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matrix with only four columns instead of five), using the embedded matrix that was 
calculated earlier. There are several SEM packages in R. The following code is presented 
using the package OpenMx (Boker et al., 2011). To understand this code, manuals for 
OpenMx are available through the package’s website openmx.psyc.virginia.edu. This 
model is specified using an asymmetric (A) and a symmetric (S) matrix, with the asym- 
metric matrix containing all single-headed arrows and the symmetric matrix containing 
all double-headed arrows (including variances). 


library (OpenMx) 
# Get all of the functions placed in the package ‘OpenMx’ 
ObsVar <- paste(“x”,c(1:4),sep="”") 
# These are the observed variables names 
LntVar <- c(“zeroth”,"first”,"second”) # Names of the latent variables 
MatNames <- c(ObsVar,LntVar) 
Combine the names of the observed and latent variables 


A <- mxMatrix(type="Full”,nrow=length(MatNames),ncol=length(MatNames), 
free=FALSE,name="A") 

Create an OpenMx matrix of type Full (an asymmetric matrix), with 
# the same number of rows and columns as observed & latent variables. 
All of the parameters in the matrix are initially not estimated 

# (free=FALSE). OpenMx will call this matrix ‘A’ (name="A"). 
A@values[1:4,5:7] <- Wlde # Set the paths from the latent variables to 
# the observed variables equal to the weight matrix Wlde. 
A@labels[7,5] <- “Beta0” 

# Label the path from the zeroth to second derivative 

A@free[7,5] <- TRUE 
Tell OpenMx “Beta0” is a parameter to be estimated 
A@labels[7,6] <- “Betal” 

A@free[7,6] <- TRUE 


S <- mxMatrix(type="Symm”,nrow=length (MatNames) ,ncol=length (MatNames), 
free=FALSE,name="S”) 
# Create a symmetric matrix with the same dimensions as ‘A’ 
diag(S@labels) <- c(paste(“eObs”,c(1:length(ObsVar)),sep=""), 
"eZeroth”,"eFirst”,"eSecond”) 
# Set the diagonal elements of ‘S’ equal to errors for the observed 
variables and the errors for the three latent derivatives. 
diag(S@lbound[5:7,5:7]) <- 0 
Sets a lower-bound for the latent variables. 
diag(S@free) <- TRUE # Estimate the diagonal elements of `S’ 
S@free[5,6] <- TRUE; S@free[6,5] <- TRUE 
S@labels[5,6] <- “CovFirstZeroth”; S@labels[6,5] <- “CovFirstZeroth” 
# The last two lines tell OpenMx to estimate a correlation between 
# the first and zeroth derivatives. 
I <- mxMatrix(type="Iden”,nrow=length (MatNames) ,name="I") 
# Create and identity matrix ‘I.’ 
F <- mxMatrix(type="Full”,nrow=length (ObsVar) ,ncol=length(MatNames), 
free=FALSE,name="F”) 
diag(F@values[,1:4]) <- 1 
# Create a filter matrix ‘F’ to select out the observed variables. 
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After specifying the matrices, including F and I matrices, which are used in the fit- 
ting of the model,’ a model is created using the function mxModel and run using mxRun. 
It should be noted that the present dataset has missing values, so covariance modeling 
is not a good option. The present example uses mxFIMLObjective to use full informa- 
tion maximum likelihood (FIML) to estimate the model parameters; a mean matrix M is 
required for FIML estimation. The only other command of note is colnames, which adds 
the observed variable names to the embedded matrix. The names of the variables must 
match with variable names given in the OpenMx model (see where Obsvar was defined 
earlier as the observed variable names). 


M <- mxMatrix(“Full”,ncol=1,nrow=length (MatNames) ‚name="M", 

labels=c (“M1”,”"M2”,"M3”,"M4”,rep(NA,3)), 

free=c(rep(TRUE,4),rep(FALSE,3))) 

# Create a matrix for the mean estimates. The first four values 

# (observed variable means) are estimated; the last three values 

# (latent derivative means) are not estimated 
colnames (embeddedX) <- paste(“x”,c(1:4),sep=""”) 

# This assigns the variable names ‘x1’ to ‘x4’ to the embedded 

# matrix. The names in the data and ‘ObsVar’ must match. 
DLOmodel <- mxModel(“Model”,A, S, I, F, M, 

mxAlgebra (F%*%solve (I-A) %*%3S%*%t (solve (I-A))S*St(F), 
name="ECov1”,dimnames=list (ObsVar,ObsVar)), 

mxAlgebra(t (F%*%solve (I-A) %*%M) ,name="ExpM”,dimnames=list(NA,ObsVar)), 

mxData (embeddedX,type="raw”), mxFIMLObjective (“ECov1”,"ExpM”)) 
# Fit an mxModel named “Model,” using the matrices ‘A’, ‘S’, ‘I’, ‘F’, and 
`M.” The first mxAlgebra statement calculates the expected covariance 
# matrix. The second mxAlgebra statement calculates the expected means. The 
# mxData statement provides the data (embeddedX) and describes the type of 
data (type="raw”). The mxFIMLObjective statement says that we’ll be using full 
# information maximum likelihood to get parameter estimates. 


DLOout <- mxRun(DLOmodel) 

This runs the model described in ‘DLOmodel’ 

summary (DLOout) 

# This provides a summary of the results saved as ‘DLOout’ 


The summary of the output is extensive but includes values typical to many SEM outputs. 
The estimates of the parameters are as follows: 


free parameters: 


name matrix row col Estimate Std.Error 
1 Beta0 A 7 5 -0.277014375 0.5342730 
2 Betal A 7 6 -0.181290213 1.1203406 
3 eobs1 S al all 0.873431520 0.8409288 
4 eobs2 S 2 2 1.133773780 0.3251057 
5 e0bs3 S 3 3 1.136678910 0.3278818 
6 eObs4 S 4 4 1.013627047 0.7666319 
7 eZeroth S 5 5 0.322437491 0.2522938 
8 CovFirstZeroth S 5 6 0.004893195 0.0914608 
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9 eFirst S 6 6 0.121652764 0.1462238 
10 eSecond S 7 7 0.000000000 0.3876799 
11 M1 M 1 1 0.071044405 0.1801567 
12 M2 M 2 1 0.014110769 0.1862936 
13 M3 M 3 1 0.010772942 0.1869538 
14 M4 M 4 1 -0.068207834 0.1809320 


In this output one can see the estimates of the variances for the observed variables 
(eObs1 to eObs4), the variance of the latent variables (eZeroth, eFirst, eSecond), the 
covariance between the zeroth and first derivative (CovFirstZeroth), the means of the 
four columns of the embedded matrix (M1 to m4), and the estimates of By and ß, (BetaO 
and Betal). If one were interested in testing the B, and ß, parameters, it is suggested 
to run the same model with each of these parameters constrained to zero. For Bo, for 
example, this can be accomplished by changing A@free[7,5] <- TRUE to A@free[7,5] 
<- FALSE. This will tell OpenMx not to estimate the path from the zeroth derivative to 
the second derivative; whether ß, should be a value other than zero can then be evaluated 
using commonly used metrics such as the Akaike information criterion (AIC), the Bayes- 
ian information criterion (BIC), or likelihood ratio test. The interpretation of By and B, in 
terms of frequency and damping are the same as with the GOLD estimates, including the 
calculation of wavelength for By. The analyses do produce slightly different estimates in 
this case—cycles that would last 9.0 and 11.9 days for GOLD and LDE, respectively— 
this is a product of both the differing estimation methods and the relatively short, noisy 
time series being used to estimate the parameters. 


Observed or Latent? 


Many readers may be left wondering which is better: observed estimates or latent esti- 
mates? At the time of this writing, the selection between these two methods is often 
driven by how they will be used. Naturally the latent estimates could be used as part of a 
larger SEM model, or if one has multiple indicators of the same construct measured over 
time (i.e., embedded matrices for each of several measurements of the same construct). 
On the other hand, the multilevel modeling software employed by many researchers 
often requires observed values; so observed values are often used for models where one 
wishes to predict why certain people have differences in the differential equation model 
parameters. No doubt, as multilevel structural equation modeling (MSEM) tools become 
more refined, more direct comparisons of these two methods will be made. However, 
the sensitivity of some differential equation models to starting values, combined with the 
inclusion of multiple levels of analysis, seems to often be problematic for current MSEM 
software. (Given the increasing interest in MSEM, it seems doubtful that this problem 
will persist for an extended period of time.) 

One advantage of observed estimates, which will be more difficult to overcome in 
SEMs, is that GOLD can be used with unequally spaced data. While unequally spaced 
data require small changes to the code presented in the example, the GOLD equations 
presented in this chapter are the same as those necessary to work with unequally spaced 
data. This might be particularly useful, because many studies using ambulatory assess- 
ment also seem to be keen on using random sampling of a participant’s day. 
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For both methods the example provided has analyzed a single, long, time series. 
There is no reason that these same methods cannot be applied to shorter time series col- 
lected from multiple individuals. The key issue is not whether differential equation mod- 
els can be fit, but rather the consequences when interpreting results. Only with long time 
series can individual dynamics be modeled first, before consideration of why individuals 
vary in different ways. Particularly given that individual dynamics are likely to differ, it 
is important to model intraindividual data prior to modeling interindividual differences 
(Molenaar, 2004, 2008). With short time series, only a single set of parameters will 
be recoverable for the entire sample (or a select number of relatively large subsamples). 
Essentially one can fit a model that says, “If everyone changes in the same ways, then 
these are the estimated parameters for the differential equation model.” Such models 
might be required as first steps toward encouraging researchers to collect intraindividual 
data, but they also may be misleading because the group dynamics observed may corre- 
spond to few (or even none) of the individuals in the sample (e.g., Boker et al., 2008). 


Conclusion 


The modeling of intraindividual time series, particularly when there is not a clear linear 
or quadratic trend, can often seem like a complicated endeavor. This chapter has intro- 
duced one approach to describing data often dismissed as “error.” The combination of 
two ideas—derivatives and embedding—led to the possibility of modeling many small 
segments of change that occur in a time series through differential equation modeling. 
One possible differential equation model was introduced, the damped linear oscillator 
model; this model may be appropriate as a first approximation for psychological variables 
that may vary around some typical or equilibrium state. Two methods for estimating 
derivatives were then introduced and used to estimate the parameters of the damped lin- 
ear oscillator model in a sample time series. 

Differential equation models hold promise for the modeling of intraindividual vari- 
ability because of the way they allow researchers to understand relationships governing 
the momentary changes in a time series. But one must be careful not to be content with 
the end goal of fitting a differential equation, as was done in this chapter. Many models 
could fit the same data equally well. Some of the most interesting applications of differ- 
ential equation modeling, at least initially, will probably revolve around one of two ques- 
tions. The first is whether the derivatives of one construct are predicted by the derivatives 
of another construct—that is, does short-term level or change in once construct result in 
changes to another construct? This requires the consideration of integrated systems of 
variables rather than just modeling a single construct over time. The second question is 
whether there are predictors of the differential equation model parameters; that is, are 
the ways in which people change over time related to other constructs such as personality 
traits? 

However, much more methodological development has yet to occur, which will hap- 
pen only as researchers in many areas begin to form nuanced theories about daily life. In 
many ways, to explore daily life is to sail into uncharted waters because in many areas 
there is little or no literature about change on a time scale as fine as daily measurement, 
the data are often very difficult and expensive to collect (and even then, we are often 
less practiced in collecting such data), and many of the statistics that ask questions of 
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interest are in the process of development or have yet to be developed. But the realiza- 
tion that there is a vast, uncharted part of the map laid before us also offers a tantalizing 
possibility—discoveries that will bring a new spice to psychological research, a new flavor 
to the conceptualization of daily life that right now teeters on the edge of imagination. 


Notes 


1. Calculating higher-order derivatives follows the pattern expected based on calculus. For example, 
the third derivative (change in rate of acceleration, or jerk) can be estimated by raising the first 
derivative paths to the third power and dividing by six. 

2. As with other SEM models, one must consider identification of the model; for the damped linear 
oscillator model in Figure 24.5 the embedded matrix must have an embedding dimension of at least 
4 to be identified. 

. The code in this chapter will be made available through the author’s website. 

4. We could code time in hours, if we were interested in having results using a time scale other than 

days. In this case, one could set Tequalto c(0, 24, 48, 72) 

5. Readers unfamiliar with the A, S, F, I matrix notation might find clarity by examining the introduc- 
tory pages of the Mx manual (Neale, Boker, Xie, & Maes, 2003)—the precursor to OpenMx. The 
manual is available through the website www.vcu.edu/mx. This is also known as the RAM approach 
(McArdle & McDonald, 1984). 


Ww 
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CHAPTER 25 


Within-Person Factor Analysis 


Modeling How the Individual Fluctuates 
and Changes across Time 


ANNETTE BROSE 
NILAM RAM 


he Internet and the electronic devices many of us now carry as we go about our 

daily lives provide a wide array of opportunities to collect more and more data from 
more and more study participants. These remote and mobile technology data collection 
platforms can reduce the burdens that often fall on the individuals providing the data 
(e.g., travel to the laboratory), and on the research assistants entering and cleaning the 
data. Researchers increasingly have large datasets, with many repeated observations of 
many individuals on many variables—and many opportunities for analysis. These recent 
developments have tremendous implications for how psychological phenomena can be 
approached, both in principle and in practice. 

For almost half a century, one branch of research methodology has been promoting 
the individual as the proper unit of analysis when investigating psychological processes 
(e.g., Cattell, 1966a; Lamiell, 1981; Molenaar, 2004; Nesselroade, 2007; Ram & Ger- 
storf, 2009; Valsiner, 1986; see also Hamaker, Chapter 3, this volume). This “person- 
specific” school has repeatedly reminded psychologists that insights obtained in the 
examination of group differences cannot be interpreted as applying at the individual 
level (i.e., ecological fallacy; Robinson, 1950). For example, finding that individuals with 
higher levels of motivation are likely to be those with better performance outcomes does 
not necessarily translate to the within-person relations. It may be that for individuals 
with high levels of fear of failure, days with above-average motivation are also days char- 
acterized by less successful performance because of an unfortunately high level of arousal 
in the performance situation. Thus, relationships among variables may be moderated by 
individual characteristics—or even be completely person-specific or idiosyncratic. 

In the 1990s, multilevel modeling emerged as a promising data-analytic technique 
addressing the heterogeneity of individuals’ functioning across time (e.g., Snijders & 
Bosker, 1999; see also Nezlek, Chapter 20, this volume). In brief, multilevel models 
explicitly allow for between-person variation in within-person associations. Differences 
in the within-person associations among variables are modeled as deviations (random 
effects) around average sample-level coefficients (fixed effects). Coming back to the exam- 
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ple, within-person, across-time associations between motivation and performance can be 
modeled as a function of between-person differences, such as fear of failure. This mul- 
tilevel modeling approach has been extremely useful in articulating and describing the 
heterogeneity of individual functioning—under the assumptions that all individuals are 
drawn from a single population that is fully described by population-level averages and 
normally distributed differences around those averages. 

In this chapter we review an alternative “bottom-up” approach for modeling within- 
person associations—within-person factor analysis (e.g., Cattell, Cattell, & Rhymer, 
1947; Jones & Nesselroade, 1990). This approach does not presuppose that all individu- 
als fall into a single population described by average population parameters and normally 
distributed between-person variation around those. Instead, modeling proceeds at the 
level of the individual and allows for true idiosyncrasy in the relations among variables. 
Each person can be different from all other persons. No assumptions are made about 
sample- or population-level distributions. As a bottom-up approach, the aim is first to 
model individual psychological functioning—one person at a time—and only later derive 
similarities and differences among persons (Nesselroade & Ford, 1985). In the following 
sections we review the conceptual and technical background for within-person factor 
analysis and provide a primer for the application of P-technique to multivariate time 
series data. Step-by-step procedures are illustrated using daily diary data obtained from 
five women over 100+ days (the Lebo Data; Lebo & Nesselroade, 1978). Finally, we point 
to some extensions in within-person factor-analytic models. 


Conceptual Background 


Factor analysis is a method for investigating the structure of a set of variables. The basic 
principle is data reduction, with the aim of representing the covariation among observed 
variables in terms of linear relations among a smaller number of abstract or latent vari- 
ables (Cattell, 1966a). The underlying idea is that if two or more characteristics covary, 
they may reflect a shared underlying construct. The patterns of covariation reveal the 
latent dimensions that lie beneath the measured qualities. A well-known example of the 
application of factor analysis is the Big Five model of personality (McCrae & Costa, 1997). 
According to this model, between-person differences in multiple observable behaviors 
are parsimoniously represented by interindividual differences in five personality factors 
(traits). The Big Five were, in part, developed through application of factor analysis to a 
multipersons x multivariables (x single occasion) matrix of scores, or what Cattell (1966a) 
called R-data. As such, the model provides a parsimonious description of interindividual 
variation. As noted earlier, interpretation at the level of the individual is unwarranted. 
However, if individual-level representations are wanted, factor analysis can be applied to 
a multioccasions x multivariables (x single person) matrix of scores, or P-data. Applied to 
this multivariate time series data, the P-technique factor model provides a parsimonious 
description of intraindividual variation—a parsimonious description of how states travel 
together or covary across time. The obtained structure can rightfully be interpreted at the 
level of the individual—albeit only for the particular individual being studied. 
Approaching psychological phenomena from a within-person analysis perspective is 
crucial for epistemological reasons. P-technique, as well as other forms of within-person 
factor analysis, rests on the idea that people are dynamic and complex systems whose 
adaptive, regulatory, and other processes proceed at the individual level (Ram & Ger- 
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storf, 2009). Identifying patterns of covariation in multiple observations of the same per- 
son across time provides a representation of that individual’s functioning (Nesselroade, 
1991). The focus is on describing, explaining, predicting, and potentially modifying indi- 
vidual behavior, not sample- or population-level behavior. Interindividual differences in 
the individual-level processes can be studied in a subsequent step. 

Over and above, the study of relations among variables at the within-person level has 
been called a necessity given that knowledge about how variables relate across individuals 
at a single time point (between-person covariation) cannot be used to make inferences 
about any individual’s actual behavior (Molenaar, 2004; Robinson, 1950). Recently, 
Molenaar (2004) underscored this point using mathematical proofs. Outlining the rel- 
evance of ergodic theorems from mathematical statistics, he demonstrated that within- 
person and between-person structures are equivalent (i.e., ergodic) only under very strict 
(and likely rare) conditions, namely, (1) stationarity of variables’ attributes and (2) equiv- 
alence of the relations among variables for all individuals. As such, it seems imperative 
that researchers make use of within-person analysis frameworks in the formulation and 
testing of psychological theory (see also Hamaker, Chapter 3, this volume). 

Within-person analysis factor has been used in some areas of psychology for decades 
(e.g., behavioral analysis, clinical psychology; for reviews, see Jones & Nesselroade, 
1990; Luborsky & Mintz, 1972; Russell, Jones, & Miller, 2007). The first applications 
of within-person factor analysis, P-technique in particular, date back over half a century 
(Cattell et al., 1947). Dynamic extensions (i.e., extensions that model time-dependent 
aspects of time series) were introduced a quarter-century ago (Molenaar, 1985). Appli- 
cations span many areas of inquiry, including, for example, the investigation of intrain- 
dividual covariation of affective states (Zevon & Tellegen, 1982), of personality factors 
(Hamaker, Nesselroade, & Molenaar, 2007), and of motivation and cognition (Brose, 
Schmiedek, Lövden, Molenaar, & Lindenberger, 2010). In the latter case, for example, 
204 individuals worked on cognitive tasks across 100 occasions and answered questions 
on their motivational state. Research questions spanned inquiry into whether days with 
better performance were days with more motivation, whether two components of moti- 
vation could be differentiated (effort and enjoyment), and whether the assumed positive 
relationship between motivation and performance would decrease with age. Assumptions 
were supported only at an average level. Substantial interindividual differences were pres- 
ent both in the dimensionality underlying covariation and in the strength of the associa- 
tion between motivation and cognition. As this and other studies demonstrate, the tech- 
nological advances in data collection and computation speed are turning within-person 
factor analysis into a mainstream methodology. 


Technical Background 


P-technique factor analysis is procedurally the same as the familiar between-person 
(R-technique) factor analysis to which most researchers are exposed as part of their 
graduate research methods training. What differs are the data to which the models are 
applied. In the usual R-technique factor analysis, the common factor model is applied to 
multivariate observations obtained from multiple subjects at a single measurement occa- 
sions (a persons x variables matrix of scores). In contrast, in P-technique factor analysis, 
the common factor model is applied to multivariate single-subject time series data (an 
occasions x variables matrix of scores). The model can be written as 
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y(t) = A n(t) + elt) (1) 


where y(ż) is a p-variate time-series of observations indexed by time (t = 1, 2, ... , T), A is 
ap x q factor loading matrix, n(f) is a q-variate time series of latent factor scores, and e(t) 
is a p-variate residual (specific error + measurement error) time series. An example model 
is depicted graphically in Figure 25.1. The path model depicts how a six-variate y(t) time 
series (squares labeled y1 to y6) is “driven” by two common factor score series (circles 
labeled n1 and n2) that are appropriately weighted by the factor loadings Al to A6, and 
six residual series (circles labeled €1 to £6). Summarizing across observations we obtain 
the covariance expectations of the model 


Z=-APN+O (2) 


where È, the expected p x p covariance matrix, is a function of ¥, the q x q covariance 
matrix of the common n factors, pre- and postmultiplied by A, a p x q matrix of factor 
loadings; and © is a q x q covariance matrix of the unique & factors. Note that in the gen- 
eral model some (identification) constraints must be imposed to identify the model (e.g., 
keep the number of degrees of freedom equal to or greater than zero). The model can be 
fit to data using common statistical packages (e.g., Mplus, OpenMx, R, SAS) in either an 
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FIGURE 25.1. Path diagram of P-technique factor model. 
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exploratory or confirmatory manner. The key is that the individual-level time series data 
to be analyzed are arranged as an occasions x variables data matrix and summarized as 
a within-person covariance matrix, È. 

In the usual R-technique analysis, the between-person common factor model, y(i) = 
A (i) + e(i), is used to model data obtained from many individuals, i = 1 to N, under the 
assumption that the observations are independent replicates; that is, that each observa- 
tion is an independent and exchangeable draw from a population of persons. The rows 
of data can be reordered or shuffled without consequence. In P-technique factor analysis, 
the common factor model, y(t) = A n(ż) + e(t), is used to model data obtained from one 
individual over many occasions, t = 1 to T. A parallel set of assumptions applies—that 
each of the observations (repeated measures over time) is an independent and exchange- 
able draw from a population of occasions. The rows of data (now repeated measures) can 
be reordered or shuffled without consequence (e.g., there are no time-related associations). 
Depicted graphically in Figure 25.1, there are no sequential dependencies (arrows) between 
the variables (latent and manifest) at occasion t- 1 and those at occasion t. The labels for 
the two occasions could be swapped, t and t- 1, without effect on the model fit or model 
parameters. Given organismic continuity, this is an unlikely circumstance. Rarely would 
we find that repeated measures obtained from the same organism are truly independent 
states or net intraindividual variability (for a distinction of time-structured and net intrain- 
dividual variability, see Ram & Gerstorf, 2009). However, trends and time dependencies 
can be removed and set aside before application of P-technique analysis. Alternatively, as 
we discuss later, the time dependencies can be incorporated into the analysis framework 
using dynamic factor models (e.g., Molenaar, 1985). Decisions about how to separate and 
treat time-structured and net intraindividual variability in the data are at the discretion of 
the researcher, and typically consider both conceptual arguments and study design. 


Five Steps for Conducting Within-Person Factor Analysis 


The following sections describe a four-step approach for conducting within-person fac- 
tor analysis. After the essentials of each step are provided, we illustrate implementation 
using data obtained from five pregnant women who provided ratings of their daily mood 
on 100+ consecutive days (Lebo & Nesselroade, 1978). Scripts for the analysis that may 
be adapted to other P-technique research questions are provided at www.hhdev.psu.edu/ 
hdfs/faculty/ram. htm. 


Step 1: Research Question 
RESEARCH QUESTIONS 


Every study begins with a research question and corresponding study design. When inter- 
ested in how different variables are related within individuals across time, one may have 
a more or less explicit theory about the underlying structure (e.g., the number of factors, 
the underlying constructs, and the relationships among factors). The aim of the study 
may be to test whether the explicitly stated theory is correct. However, there are also 
investigations in which one does not have an explicit a priori underlying structure to test. 
In these studies, the aim is to describe whether and how the variables are related, and/ 
or how those structures differ across individuals. Conceptually, this is the distinction 
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between confirmatory and exploratory factor analysis (CFA and EFA, respectively; see 
Jöreskog & Sörbom, 1979; Thompson, 2004). 


EXPLORATORY P-TECHNIQUE FACTOR ANALYSIS 


Researchers make use of EFA when searching for the simple structures that may describe 
a set of observations. For example, consider time series data collected from individuals on 
how they deal with everyday problems across 100 days, using 16 items that represent four 
broad categories of behavior (planful problem solving, avoidant behavior, distraction, 
and rumination). There is not yet evidence confirming that those 16 behaviors organize 
into four factors at the within-person level. In such instances, it seems useful to explore, 
without explicit hypotheses, how the different problem-solving behaviors covary within 
a person, and whether there are between-person differences in those structures. 


CONFIRMATORY P-TECHNIQUE FACTOR ANALYSIS 


When generating evidence that a specifically postulated structure underlies the data, 
researchers make use of CFA. Theory-guided expectations lead to the formulation of a 
precise model that then is subjected to empirical confirmation. For example, the struc- 
ture of cognitive abilities is hypothesized to revolve around a set of correlated factors 
representing subdomains of cognitive functioning. Factors that have been theoretically 
and empirically distinguished are, for example, working memory, perceptual speed, and 
short-term memory (e.g., Mogle, Lovett, Stawski, & Sliwinski, 2008). However, up to 
now, there is little evidence that these same and other constructs can be used to organize 
the day-to-day variability in an individual’s cognitive task performance. Within-person 
CFA is then needed to confirm (or reject) the hypothesis that day-to-day variability on a 
variety of cognitive performance tasks can be parsimoniously represented by fluctuations 
in a smaller set of ability constructs (Lindenberger, Li, Lövden, & Schmiedek, 2007). 


Step 1: Empirical Illustration 
RESEARCH QUESTION 


As we all know from personal experience (as well as a plethora of empirical research), 
individuals’ positive affect fluctuates over time. Some days are characterized by high lev- 
els of friendliness and affection; other days, by low levels of peppiness and liveliness. The 
questions that within-person factor analysis is suited to answer center on whether these 
day-to-day differences can be described by a parsimonious set of positive affect states 
(e.g., energy, well-being, and social affection). 


EXPLORATORY P-TECHNIQUE FACTOR ANALYSIS 


Using an exploratory approach, we investigated the structure of individuals’ day-to- 
day experiences of positive affect and whether their positive affective experiences are 
described by similar or different structures. We examined three specific content domains: 
energy, well-being, and social affection (as per the distinction given by the Lebo & Nes- 
selroade [1978] original analysis of the data being used here). We were interested to learn 
about the dimensionality of affective experiences (whether positive affective experiences 
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are driven by three distinct factors) and about potentially person-specific structure and 
correlations among the different factors. 


CONFIRMATORY P-TECHNIQUE FACTOR ANALYSIS 


Using a confirmatory approach, we tested whether individuals’ data could be described 
by a specific simple structure defined by three intercorrelated factors, specifically, whether 
daily levels of peppiness, activity, and liveliness would indicate an individual’s daily level 
of energy; whether cheerfulness, happiness, and contentedness would indicate daily levels 
of well-being; and whether friendliness, affection, and kindliness would indicate daily 
levels of social affection. Furthermore, we tested whether this simple structure general- 
izes across individuals, that is, the hypothesis that all five women’s data are well described 
by the same structure (between-person measurement invariance). 


Step 2: Study Design and Data Collection 


One requirement for within-person factor analysis is that each individual must be mea- 
sured on multiple variables repeatedly over many occasions (Cattell, 1966a). The resulting 
time series must be (1) of considerable length, (2) collected on a time scale that matches the 
phenomena, and (3) at equally spaced intervals, when using dynamic factor analysis (see 
below). Although there are no clear rules, it has been recommended that factor-analytic 
studies use a ratio of at least five observations to each variable, and not less than 100 obser- 
vations in any analysis (Gorsuch, 1983, p. 332). The phenomenon should vary on the time 
scale on which the observations were obtained (Sliwinski & Mogle, 2008). If one assumes 
day-to-day fluctuations in mood, data should be obtained at least once a day. In addition, if 
different constructs are investigated simultaneously, they should be expected to covary on 
the same time scale. For example, although negative affect and perceived stress are corre- 
lated, one might not find a within-person covariation because affect might fluctuate faster 
(e.g., across days) than perceived stress that might fluctuate across weeks or months. 


Step 2: Empirical Illustration 


M. Lebo collected the data we use for our illustration as part of his investigation of moth- 
ers’ affective experiences leading up to and surrounding the birth of their first child (Lebo 
& Nesselroade, 1978). Data were obtained from five pregnant women who rated their 
mood on 100+ successive days using 75 adjectives (0- to 4-point response scale) covering 
a wide swath of constructs prevalent in the literature of the time. The resulting time series 
(1) span more than 100 occasions; (2) were collected once per day, a time scale that allows 
for modeling the structure of daily affective experiences; and (3) were obtained at equally 
spaced intervals, once each evening. 


Step 3: Data Preprocessing and Variable Selection 
DATA PREPROCESSING 


After collection and before the main analysis, the data should be examined for suitability 
for application of the P-technique common factor model. Specifically, trends, cycles, and 
other time-structured aspects of the data should be identified and potentially removed 
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to obtain time series that are stationary and do not contain time dependencies that may 
violate the independence of observations assumption. There are no clear guidelines on 
what preprocessing steps are most appropriate, but rather numerous options, including 
linear, quadratic, or other polynomial regressions (e.g., detrending), frequency analysis 
(i.e., spectral analysis), autoregressions (e.g., autoregressive integrated moving average 
[ARIMA] models), or the application of various filters or smoothers (e.g., loess) (see, 
e.g., Shumway & Stoffer, 2000, for a review). The choice of preparations can consider 
both statistical criteria for testing whether the time series to be analyzed is free of time- 
structured elements (i.e., that it is white noise; e.g., Ljung & Box, 1978) and theoretical 
evaluation of what processes should be identified and removed from the data. Modeling 
and removing these elements is, in essence, a procedural setting aside of particular pro- 
cesses (e.g., learning, circadian rhythms) in order to concentrate on underlying structures 
that are independent of those processes. 


VARIABLE SELECTION 


Once individuals’ time series of observations have been collected and prepared, it is 
important to determine whether the data are, in actuality, suitable for application of 
within-person factor analysis. Most important, there must be reliably measurable varia- 
tions in scores on the specific variables to be analyzed (Comrey & Lee, 1992, p. 238). 
Variables with no within-person variance across time cannot be subjected to factor analy- 
sis. Various rules of thumb have been used to identify and remove variables that do not 
have “sufficient” variance for analysis. These include removing variables with (intrain- 
dividual) standard deviations below 0.10, or variables with more than 80% of scores 
being identical (see, e.g., Lebo & Nesselroade, 1978; Zevon & Tellegen, 1982). The issue 
becomes complicated when the items with insufficient variance differ across individuals. 
Three routes can be taken. 


1. Specific items can be excluded from the individual-level analysis, potentially 
resulting in a different set of variables being analyzed for each individual in the sample. 
The advantage of this route is that as much information as possible is maintained in the 
analyses. 


2. Alternatively, variables that have insufficient variability in one or more individu- 
als are all excluded from the analyses. The advantage of this approach is that across- 
person factor patterns are easily compared. 


3. Both, specific individuals and items, can be removed to strike a balance between 
maintaining a common set of items and individuals. 


Step 3: Empirical Illustration 
DATA PREPROCESSING 


Our goal was to prepare data from the five individuals so that they met the independence 
assumption. To this end, we took the following steps. First, each individual’s time series 
was plotted and inspected visually. The item-level trajectories did not show trends across 
time and were treated as stationary. Second, we tested whether there were significant 
autocorrelations using the Ljung-Box test (for up to six lags; implemented using SAS 
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PROC ARIMA). When significant lagged relationships were found, we modeled them 
using an autoregressive [AR(1)] process model. The residuals were then again tested for 
significant autocorrelations. If there were still significant autocorrelations, the model 
was expanded to an AR(2), and the residuals were again examined. The procedure was 
repeated with additional AR effects added until it was determined that there were no lon- 
ger significant time dependencies present at the item level for each individual. The final 
set of residuals, now statistically determined to be white noise, were then collected back 
into a multivariate time series that was deemed suitable for P-technique factor analysis. 
The prepared nine-variate time series for Individuals 1 and 5 are plotted in Panels A and 
C of Figure 25.2. 


VARIABLE SELECTION 


In his preliminary analysis of the data, Lebo noted that the variation in daily affect did 
not manifest on all 75 items for all women. Within each woman’s P-data, he removed 
items where more than 80% of responses were identical. In the original analysis, Lebo 
and Nesselroade (1978) factor-analyzed all items that had sufficient variance (the first 
of the alternatives given earlier), letting the number of items and dimensionality of the 
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FIGURE 25.2. Panels A and C: Day-to-day ratings (after data preprocessing) of nine affective 
states: active, peppy, lively, cheerful, happy, contented, friendly, affectionate, and kindly for Indi- 
viduals 1 and 5, respectively. Panels B and D: Estimated factor scores, three factors for Individual 
1 (energy, well-being, social affection), two factors for Individual 5 (energy, well-being/affection). 
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affective space differ across individuals. The idiosyncrasy of affective experiences is rep- 
resented best using this approach, but comparisons across individuals are often difficult 
and remain rather descriptive (e.g., the number of factors extracted varied from five to 
nine across individuals; Lebo & Nesselroade, 1978, p. 211). 

In our illustration, we used the second of the three alternatives listed earlier. We 
analyzed only the subset of items that exhibited sufficient variance in all women. The 
benefit of this approach is that subsequent quantitative comparisons across individuals 
can be approached in a more straightforward manner. To keep our example parsimoni- 
ous, we limited ourselves to nine items: peppy, active, and lively (as indicators of an 
energy factor), cheerful, happy, and contented (as indicators of a well-being factor), and 
friendly, affectionate, and kindly (as indicators of a social affection factor). We note that, 
as is often the case when working with real data, there has been some “offline” iteration 
between the development of the research questions and the variable selection. 


Step 4: Fitting and Evaluating Person-Specific Models 
EXPLORATORY P-TECHNIQUE FACTOR ANALYSIS 


Exploratory P-technique factor analysis can be implemented in most statistical programs, 
and many resources are available for the specifics of implementation (e.g., Comrey & Lee, 
1992; Gorsuch, 1983). Important decision points include choice of index of association, 
determining the number of factors, and rotation and interpretation of the factors (and, in 
some instances, estimation of the factor scores). Choice of index of association is driven 
by the measurement properties of the collected data, including the scale of measurement 
(e.g., nominal, interval, ratio) and the distribution of scores (e.g., deviations from nor- 
mality that may require transformation). P-technique factor-analytic studies have usually 
analyzed the within-person Pearson product-moment correlation matrix of the prepro- 
cessed data. However, matrices made from other measures of association can also be 
used (e.g., covariances, point biserial, tetrachoric correlations). Methods for determining 
the number of factors span statistical methods (e.g., maximum likelihood), mathematical 
methods (e.g., Guttman’s lower bounds, eigenvalues > 1; Guttman, 1954), and subjective 
evaluations (e.g., subjective scree test; Cattell, 1966b). We note that in most P-technique 
applications the number of observations (sample size) is usually small, so likely some 
care should be taken to protect against possible overextraction (Browne, 1968). Factor 
solutions are often rotated for ease of interpretation (rotating to the most parsimonious 
position, a simple structure). A key decision is choice of an orthogonal rotation method 
(e.g., varimax), where it is assumed that the factors are uncorrelated, or an oblique rota- 
tion method (e.g., promax), where the factors may be correlated. Choice should depend 
on theoretical conceptions and on preference for particular definitions of simple structure 
that facilitate interpretability. Working through usual step-by-step EFA procedures (see, 
e.g., Comrey & Lee, 1992, p. 5; Gorsuch, 1983, back cover), the person-specific fac- 
tor solution obtained separately for each individual can be independently evaluated and 
interpreted in relation to theory and other empirical evidence. 


CONFIRMATORY P-TECHNIQUE FACTOR ANALYSIS 


Confirmatory P-technique factor analysis can be implemented in structural equation 
modeling programs (e.g., LISREL, Mplus, OpenMx) or some specialized procedures (e.g., 
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PROC CALIS in SAS), with many resources available for the specifics of implementation 
(e.g., Brown, 2006; Loehlin, 1998). In brief, a model of covariance expectations (in the 
form of Equation 2, or the graphical analogue in Figure 25.1) is specified in accordance 
with the theoretical expectations (e.g., a specific hypothesis about structure). The model 
is then fitted, using maximum likelihood procedures, to each individual’s data. Model fit 
statistics (e.g., chi-square, root mean square error of approximation [RMSEA], and com- 
parative fit index [CFI]; see Hu & Bentler, 1999) and parameter estimates are obtained, 
evaluated independently for each individual, and interpreted in relation to theory and 
other empirical evidence. For example, the fit criteria can be used to determine whether 
the hypothesized structure provides an adequate description of that individual’s data. 


Step 4: Empirical Illustration 
EXPLORATORY P-TECHNIQUE FACTOR ANALYSIS 


To explore the structure of positive affective states of the five women being studied, 
we conducted five P-technique EFAs using SAS PROC FACTOR. For each analysis, the 
P-data being analyzed consisted of a 100+ occasions (days) x nine variables matrix of 
scores. Principal factor extraction procedures were used to estimate commonalities and 
obtain eigenvectors and associated eigenvalues for each individual. Three criteria were 
used to determine the number of common factors present in each individual’s data: the 
number of eigenvalues > 1, the scree plot, and parallel analysis (Horn, 1965). When the 
three criteria were not fully coincident, we settled on the more parsimonious solution 
(with preference for the parallel analysis indication; Lance, Butts, & Michels, 2006). 
Each person-specific factor solution was rotated to an oblique simple structure using the 
promax method. Results from the five separate P-technique EFAs are provided in Table 
25.1. 

For brevity, we describe in detail only the evaluation and interpretation of the solu- 
tion obtained for Individual 1. Allthree criteria (number of eigenvalues > 1, the scree plot, 
and parallel analysis) suggested that there were two factors. The first factor, characterized 
by high loadings for cheerful, happy, contented, friendly, affectionate, and kindly, was 
thus named well-being/affection. The second factor, characterized by high loadings for 
active, peppy, and lively, was named energy. Together, these two factors were positively 
and moderately correlated (r = .36) and accounted for 55% of the total variance. The 
P-technique factor analysis of this individual’s data suggests that positive affective expe- 
riences are structured by two underlying constructs. The evaluation and interpretation 
of the solutions proceeded similarly for the other four solutions, with Individual 2’s data 
characterized by a two-factor structure that accounted for 65% of total variance and 
Individual 4’s data by a two-factor structure (69% of total variance). Finally, Indvidual 
3's and Individual 5’s data were characterized by a one-factor structure (50% and 62% of 
total variance), suggesting that, for these individuals, positive affective experiences were 
driven by a single common core construct. 


CONFIRMATORY P-TECHNIQUE FACTOR ANALYSIS 


To test whether the positive affective experiences of the five women are well described 
by a theoretically determined three-factor structure (factors: energy, well-being, social 
affection), we conducted five P-technique CFAs using SAS PROC CALIS. Specifically, 
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we tested whether a three-factor, simple structure model fit the data of each of the five 
women. The model was set up, such that each of the nine variables would load on one 
factor, that the latent factors were correlated, and that the residuals were all independent. 
The model was identified by fixing a single loading for each factor equal to one. Person- 
specific standardized results, along with select fit indices, are provided in Table 25.2. 

The fit of the model to Individual 1’s data was adequate (RMSEA < .10; standardized 
root mean square residual [SRMR] < .10). It seems that the three-factor structure pro- 
vides a reasonable representation of this individual’s day-to-day positive affective experi- 
ences. A plot of the resulting three-variate factor series is shown in Panel B of Figure 25.2. 
As can be seen, scores on the energy, well-being, and affection factors somewhat travel 
together (positive correlations among factors), and there is substantial parsimony in the 
factor score representation of this individual’s affective experiences compared to the raw 
score representation in Panel A. 

The evaluation and interpretation of the other person-specific solutions proceeded 
similarly. However, a caution flag appeared in two cases. Looking at the model parameters 
obtained when fitting to data of Individuals 4 and 5, it may be noted that the correlation 
between the well-being and social affection factors exceeded 1.00, a clear indication that 
the two factors are not separable—overfitting. Such results suggested that we consider an 
alternative, namely, that daily positive affect is driven by only two factors. An alternative 
model was developed, and a second set of CFAs was fit to the data. Specifically, we tested 
the fit of a two-factor model with the items active, lively, and peppy loading on an energy 
factor and the items cheerful, happy, contented, friendly, affectionate, and kindly loading 
on a well-being/affection factor. Person-specific standardized results, along with select fit 
indices, are provided in Table 25.3. 

With the two-factor model being nested under the three-factor model, we were able 
to test, using a chi-square difference test, whether the three-factor model provided a bet- 
ter statistical fit to the data than the more parsimonious two-factor model. This test was 
done separately for each individual. For Individual 1, the three-factor model provided a 
better fit to the data (A x? = 21.8, df= 2, p > .05); for Individual 2, the two-factor model 
(A x2 = 4.3, df= 2, p < .05); Individual 3, the two-factor model (A x? = 4.0, df = 2, p < .05); 
Individual 4, the two-factor model (Ay? = 2.5, df = 2, p < .05); and Individual 5, the two- 
factor model (Ay? = 2.3, df = 2, p < .05). A plot of the resulting two-variate factor series 
for Individual 5 is shown in Panel D of Figure 25.2. In summary, a confirmatory test was 
used to confirm whether the theoretically derived models fit each individual’s data, and a 
difference test was used to determine whether the two- or three-factor models provided 
the better representation of day-to-day positive affective experiences. 


Step 5: Between-Person Differences 


Within-person factor analysis is a person-specific method, in that it is a technique for 
investigating psychological phenomena at the level of the individual. Models are fitted 
to individuals’ data, and the solutions of those models are evaluated one individual at 
a time. Nevertheless, proponents of within-person factor analysis have emphasized that 
the aim of psychological science is to derive nomothetic laws of behavior (Nesselroade, 
2007; Nesselroade & Molenaar, 1999). The individual solutions must at some point be 
integrated for purposes of generalization. For example, as per the other studies of inter- 
individual differences, one may seek to classify individuals into groups. 
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TABLE 25.3. Results from Five Person-Specific Two-Factor Confirmatory P-Technique 
Factor Analyses 


Individual 1 Individual 2 Individual 3 Individual 4 Individual 5 


Model fit 
X (df = 26) 63.3 44.2 63.3 53.3 54.4 
RMSEA 0.11 0.08 0.11 0.10 0.10 
SRMR 0.07 0.05 0.04 0.05 0.04 
Std. loadings, A E WB/A E WB/A E WB/A E WB/A E WB/A 
Active 0.88 0.85 0.80 0.91 0.87 
Peppy 0.57 0.89 0.75 0.91 0.81 
Lively 0.75 0.98 0.86 0.93 0.81 
Cheerful 0.76 0.84 0.92 0.82 0.93 
Happy 0.75 0.82 0.90 0.79 0.92 
Contented 0.52 0.70 0.86 0.69 0.89 
Friendly 0.83 0.78 0.81 0.88 0.91 
Affectionate 0.84 0.48 0.76 0.74 0.73 
Kindly 0.63 0.81 0.77 0.72 0.82 
Factor correlation, ¥ 35 259 .72 .65 .65 


Note. df, degrees of freedom; RMSEA, root mean square error of approximation; SRMR, standardized root mean residual; 
E, energy; WB/A, well-being/affection. 


While the individual is maintained as the unit of analysis in within-person factor 
analysis, between-person comparisons can be made after the person-specific solutions 
have been obtained. Several approaches have been used to identify the similarities and 
differences among the individual-level structures obtained via within-person factor anal- 
ysis. EFA solutions from multiple individuals can be examined with respect to specific 
characteristics of the model and its parameters, including, for example, the number of 
factors and the pattern of factor loadings (e.g., Hamaker et al., 2007). Given the gen- 
erally small sample sizes used in P-technique studies, identification of similarities and 
differences in structure can usually be summarized through qualitative descriptions of 
individual-level results. As the sample size increases, these descriptions can be quanti- 
fied as follows: Similarities and differences among patterns of factor loadings obtained 
from multiple samples (i.e., individuals) can be quantified using congruence coefficients 
or other measures of pattern similarity. For example, in their classic P-technique study of 
the structure of affect, Zevon and Tellegen (1982) assessed the similarity among many 
individuals’ structures using Tucker’s congruence coefficient. Using similar principles, 
Nesselroade and Molenaar (1999) assessed similarity among individuals’ raw covariance 
structures using a version of Bartlett’s test of variance homogeneity. Once interindividual 
differences in within-person characteristics have been identified (e.g., in the number of 
factors), such quantitative differences can be examined in relation to other interindividual 
differences (e.g., age; for an application, see Carstensen, Mayr, Pasupathi, & Nessel- 
roade, 2000). 

CFA solutions from multiple individuals can also be integrated through both qualita- 
tive descriptions of the individual results and formal statistical tests. Modern computing 
provides the possibility to assess the fit of several a priori models to many individuals’ 
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P-data quickly and efficiently. For example, individuals can be categorized into homog- 
enous groups based on the relative fit of multiple factor models (e.g., grouping individuals 
based on whether their data were best fit by a one-, two-, or three-factor model; see Ram, 
Carstensen, & Nesselroade, 2011), or rank-ordered according to a particular parameter 
in the model (e.g., factor correlation) or according to fit of a specific model (e.g., absolute 
misfit of a specific two-factor model; Ong, Horn, & Walsh, 2007). Furthermore, making 
use of multiple-group equality constraints, it is possible to test formally whether two or 
more individuals’ data can be described by the same factor model parameters. The logic 
of such pairwise tests exactly follows the logic underlying tests for measurement invari- 
ance across multiple groups or occasions (Meredith, 1993; see also Eid, Courvoisier, 
& Lischetzke, Chapter 21, this volume). Specifically, observations from each individual 
are conceptualized as separate groups, with confirmatory models being fit to the multi- 
group data with and without equality constraints. Nested model comparisons provide 
evidence that the individual models can be considered equivalent or different, just as 
individuals can be described by the same model, and separated from unlike individuals. 
We underscore that, as per the bottom-up strategy, identification, description, and testing 
of between-person similarities and differences should be completed only after individual- 
level models have been obtained and examined. 


Step 5: Empirical Illustration 


Looking across individuals, exploratory solutions presented earlier, three individuals’ 
day-to-day positive affect experiences were described by a two-factor structure, energy 
and well-being/affection. Looking at Individuals 2 and 4, the factor intercorrelations 
were very similar (r = .55 and .57, respectively), while for the third, the factors were a 
bit more distinct (r = .36). Thus, we see that the degree to which energy and well-being/ 
affection travel together differed across individuals. Although the factor loadings for the 
indicator items were generally high for all individuals, the main indicators of each fac- 
tor differed. Thus, the underlying constructs are “colored” slightly differently for each 
individual. The other two individuals’ data, Individuals 3 and 5, were characterized by 
one-factor structures. For these individuals, all the positive affect states examined seem 
to occur in tandem. It may be further noted that for both these individuals, the factor is 
most prominently indicated by items typically considered as indicators of well-being (e.g., 
cheerful, happy, contented). 

To quantify the between-person similarities across the exploratory P-technique solu- 
tions we compared the loading matrices of individuals with the same number of factors 
(e.g., the two-factor or one-factor solutions) using Tucker’s (1951) coefficient of congru- 
ence. We found high levels of congruence throughout (i.e., c > .85; Lorenzo-Seva & 
ten Berge, 2006). Across Individuals 1, 2, and 4, the congruence between the loading 
patterns for the energy factor all ranged between .95 and .98, and between the loading 
patterns for the well-beinglaffection factor, between .97 and .99. The loading patterns 
for the one-factor solutions (Individuals 3 and 5) were found to be perfectly congruent 
(1.0) using Tucker’s index. In summary, by both subjective and objective evaluations, the 
solutions were very similar among the three “two-factor individuals” and among the two 
“one-factor individuals.” 

Looking across the five person-specific confirmatory solutions, we found that four 
individuals’ data were better described by the two-factor energy and well-being/affection 
model than by the model complex three-factor energy, well-being, and affection model. 
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For all individuals the two factors were positively and relatively highly correlated, and for 
the most part the items loaded comparably on their respective factors (except affection- 
ate for Individual 2). For Individual 1, though, the three-factor model provided the better 
representation of the data, although still with a relatively high correlation between the 
well-being and affection factors. 

We examined the similarities among the two-factor solutions using pairwise statisti- 
cal tests. Specifically, treating each individual as a separate group, we tested whether the 
parameters for a pair of individuals were metrically invariant (i.e., that the factor loadings 
were the same). Six pairwise comparisons (two individuals each) were tested. The chi- 
square difference tests (7 df) ranged from 8.5 (Individuals 2 and 3) to 28.4 (Individuals 3 
and 4), with the pairwise tests among Individuals 2, 3, and 5 being nonsignificant (critical 
x with o of .05 = 14.1). Following up with a three-group test, we found that Individuals 
2, 3, and 5 could all be described by the same model (metric invariance test: Ay? = 18.6, df 
= 14, p > .05). In summary, the structure of day-to-day positive affective experiences was 
the same in three of the five women in this study. 


Extensions: 
Dynamics, Nonstationarity, and Idiographic Filters 


Within-person factor analysis models have been extended in various ways. Addressing 
some of the early critiques that P-technique factor analysis ignores the time dependen- 
cies in the data (e.g., Anderson, 1963), Molenaar (1985) introduced the dynamic factor 
analysis (DFA) model. Instead of preprocessing the data to remove time dependencies, as 
we did in the example, lagged effects are incorporated directly into the model and mod- 
eled explicitly. For example, Equation 1 can be supplemented with a model for auto- and 
cross-regressions among the latent states: 


nd) = Bmt- 1) + Bn(t-2) +... +B (t-s) + E(t) (3) 


where the g-variate latent state series y(t) is a function of k = 1, 2, ... , s prior latent 
states, n(t — 1) to n(t- s), which are weighted by B; to B,. Present time “disturbances” are 
then introduced as a q-variate set of latent “innovations,” C(t) (for further details, see, 
e.g., Nesselroade, McArdle, Aggen, & Meyers, 2002). The DFA extends the P-technique 
model by providing an explicit model for a process in which present states affect or lead 
to future states. Often, such models may be more theoretically appealing than looking 
at factor structures in which the time-structured elements of the data have been stripped 
away and set aside in order to obtain net intraindividual variability and meet model 
assumptions. It has been noted, however, that in some cases P-technique common factor 
models obtain approximately the same structure and factor scores (with time dependen- 
cies intact) when applied to data that contain one-lag time-dependent processes (Mole- 
naar & Nesselroade, 2009). Given the theoretical appeal of process-oriented hypotheses, 
DFA models should be considered and tested as viable alternatives to the more restrictive 
P-technique analyses reviewed here (see Ram, Brose, & Molenaar, in press, for step-by- 
step illustration with the same data example). 

Psychological phenomena vary on different time scales, and in addition to short- 
term, reversible variability, individuals’ characteristics change across time. Recently, 
DFA models have been extended to accommodate and model nonstationary time series 
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(Molenaar & Ram, 2009; Molenaar, Sinclair, Rovine, Ram, & Corneal, 2009). In the 
P-technique and DFA models, the parameters are assumed to be constant (i.e., stationary) 
in time; that is, the structure or process does not change across the observation period. 
In the nonstationary models, the parameters become time-varying, so that transient or 
trend-type changes in the parameters that describe the structure or process (e.g., factor 
loadings, autoregressions) may change over time. 

Extending how between-person differences in structure or dynamics are approached, 
Nesselroade and colleagues recently proposed that invariance tests should concentrate on 
similarity of factor correlations or auto- and cross-regressions (see Nesselroade, Gerstorf, 
Hardy, & Ram, 2007, and accompanying commentaries/critiques). The proposal is that 
the latent processes or structures may be highly similar or even equivalent across indi- 
viduals, even though the indicator variables may be different. 


Synopsis 


Our purpose in this chapter was to introduce within-person factor analysis to research- 
ers interested in within-person phenomena. Such phenomena can now be studied with 
increasing ease using modern data collection technologies, such as those being used in 
experience sampling designs that collect information on stable and dynamic aspects of 
psychological functioning in ecologically valid environments. Within-person factor anal- 
ysis is a bottom-up, person-specific approach to hypothesis testing and data analysis. The 
individual is the unit of analysis. Relations among variables are investigated one individ- 
ual at a time. Applied to within-person P-data, factor analysis can be used to identify and 
describe how a set of variables travels together across time and reveal parsimonious struc- 
tures that may underlie occasion-to-occasion differences or changes in an individual’s 
behavior. Once those structures are established at the individual level, they can be exam- 
ined in relation to one another, and the between-person differences and similarities can 
be described, quantified, and examined. We hope, through our step-by-step illustration, 
that we have provided a guide for when and how within-person factor analyses may be 
incorporated into empirical research programs, taking us further along the route toward 
describing, predicting, explaining, and potentially modifying individuals’ behavior. 
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CHAPTER 26 


Multilevel Mediational Analysis 
in the Study of Daily Lives 


NOEL A. CARD 


D aily life proceeds as a chain of probabilistically causal events. In the social sciences, 
we frequently hypothesize and evaluate links in which a presumed thought, behav- 
ior, or event (X) leads to an increased or decreased likelihood of an outcome (Y). Such a 
model is shown at the top of Figure 26.1. Although daily life is infinitely more complex, in 
that an event can have different outcomes for different people or contexts (multifinality) 
and an outcome can have multiple causes that also vary across people or contexts (equi- 
finality), this general model of one variable predicting another in a probabilistic manner 
is useful in trying to understand daily life. 

Another complication arising from this simplified model is that events often do not 
directly cause the outcome, but instead set in motion a sequence of events that (proba- 
bilistically) lead to the outcome. Such a model is shown at the bottom of Figure 26.1 
These intervening events are called mediators, and a model of this process is considered 
a mediational model. Conceptualization and evaluation of such mediational models are 
of critical importance to both basic and applied science. Toward the former, these models 
provide a more complete understanding of the sequential processes occurring in daily life, 
and toward the latter, they provide options for modification of the outcome that may be 
more malleable than more distal causes. 

In this chapter, I describe methods of statistical evaluation of mediational models, 
specifically within a multilevel context. In the next section, I describe methods of evalu- 
ating mediation in the traditional, single-level context. This provides the basis for subse- 
quent sections dealing with multilevel mediation. Here, I first describe the evaluation of 
multilevel mediation models using multilevel modeling techniques. I then describe evalu- 
ation of multilevel mediation models using multilevel structural equation modeling, and 
the advantages of this approach. I end with consideration of the challenges and future 
directions in the evaluation of multilevel mediation. My goal throughout the chapter is to 
provide an accessible introduction to these techniques, and to refer interested readers to 
more technically thorough resources throughout. 
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FIGURE 26.1. Conceptual representation of predictive and mediational models: (a) Direct pre- 
dictive model; (b) Mediation model. 


Evaluating Mediation in Single-Level Contexts 


Consider again Figure 26.1, this time noting the alphabetic labels next to the paths (the 
notation in this figure and equations in this section follows that of MacKinnon, 2008). In 
the top portion (Figure 26.1a), the total strength of association of the independent vari- 
able (X) predicting the dependent variable (Y) is denoted as c. This overall association is 
decomposed into two components in the lower portion (Figure 26.1b). One component 
is the indirect effect (a.k.a. mediator effects) through the mediator (M), which is equal 
to the product of the a and b paths (i.e., the indirect, or mediated, path = ab). The other 
component is the direct effect of X predicting Y outside of the impact on the mediator, 
denoted as c’, which could include either direct impact of X to Y or other mediational 
pathways not included in the model. 

To evaluate a mediational hypothesis using (single-level, manifest variable) regres- 
sion, one estimates two equations (note that intercepts are estimated but not shown): 


M=aX+e, (1) 
Y=cCX+bM+e, (2) 


From these equations, two parameter estimates are of interest: the regression coef- 
ficient (1) of X predicting M from Equation 1, and the regression coefficient (2) of M 
predicting Y, controlling for X, from Equation 2. The regression coefficient (c’) of X 
predicting Y, controlling for M, from Equation 2 is sometimes considered if one wishes to 
describe the direct effect, but this is not of central interest in evaluating mediation. Read- 
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ers familiar with the procedures described by Baron and Kenny (1986) may question why 
a third equation (Y = cX + e4) is not estimated. Simply put, although the Baron and Kenny 
approach to evaluation based on the results of multiple analyses has been widely used in 
psychological research over the past 25 years, recent simulation work has shown it to be 
inferior (specifically, lower statistical power and failure to quantify the indirect effect) to 
modern methods of evaluating mediation (MacKinnon, Lockwood, Hoffman, West, & 
Sheets, 2002). I recommend against using the Baron and Kenny approach. 

When evaluating a mediational model, our main focus is on the ab product, which 
represents the indirect effect of interest. After fitting the two previous regression equa- 
tions (Equations 1 and 2), the indirect effect is computed by hand by multiplying the two 
regression coefficients, a and b. This value is directly reported in a research report and 
represents the increase in the dependent variable per unit increase in the independent 
variable that operates through the mediator (the remainder of the total effect of X to Y 
is through the direct effect). The unstandardized indirect effect (i.e., unstandardized a x 
unstandardized b) should be reported, as statistical inferences (described next) are made 
using this parameter; however, it may also be useful to report the standardized indirect 
effect, which is simply the product of the two standardized regression coefficients (a 
and b). 

We typically, however, also wish to make inferential conclusions about this pathway, 
either through significance testing or computation of confidence intervals. There are two 
approaches I recommend. The first of these is to compute a standard error of the indirect 
(ab) effect using the following formula derived by Sobel (1982): 


se, = Nä’sez +b’se? (3) 


This estimation of the standard error of the indirect effect is easily computed given 
the regression coefficients and their standard errors from the output of the program used 
to fit the previous regression coefficients. Specifically, 4° and 6 are the values of the two 
regression coefficients of Equations 1 and 2, respectively, and se? and seĝ are the squared 
standard errors of these regression coefficients. This standard error can then be used for 
statistical hypothesis testing (evaluating the ratio of ab to its standard error according 
to a normal distribution, Z = ab/sez) or constructing confidence intervals (adding and 
subtracting the product of the critical Z [e.g., 1.96 for 95% confidence intervals] and this 
standard error to the value of ab). Thus, the major advantage of this approach is that it is 
relatively simple to implement. 

The disadvantage is that this approach assumes that the indirect effect is normally 
distributed. However, this assumption is likely untrue, and the actual distribution is often 
asymmetric and kurtotic. For this reason, a better practice is to construct asymmetric 
confidence intervals using bootstrapping techniques, in which no assumptions are made 
about the distribution of the indirect effect. Instead, the computer samples (with replace- 
ment, so some cases appear multiple times, whereas others are excluded) cases from the 
dataset, computes the indirect effect from this drawn sample, then repeats this process 
many times (e.g., 1,000 or more). After many indirect effects are obtained from these 
samples, the distribution of these indirect effects is considered. Statistical inference is 
made by considering the range of this sample; for example, whether the middle 95% of 
the cases cover a range that does not include 0. Simulation studies have shown that a 
particular type of bootstrap, the bias-corrected bootstrap, performs well for both signifi- 
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cance testing and constructing confidence intervals of the indirect effect (MacKinnon, 
Lockwood, & Williams, 2004). 


Multilevel Mediation Models 


Social science data are often nested, or contain variability at multiple levels of organiza- 
tion. Prototypical examples are of students within classrooms or workers in companies. 
Here the nesting is of individuals within groups, and a variable might vary both within 
groups (i.e., at the individual level) or between groups, and we refer to the individuals 
as level-1 units and the groups as level-2 units. As describe by Nezlek (Chapter 20, this 
volume), the study of daily lives also frequently contains nesting of time periods (e.g., 
days) within individuals. Here, we could consider time as the level-1 unit and individual 
as the level-2 unit. Of course, there could be higher orders of nesting such as time (level 1) 
nested within individuals (level 2) nested within groups (level 3), but I focus only on the 
two-level situation in this chapter for simplicity. 

Just as this nesting necessitates special multilevel analyses to answer other research 
questions, this nesting requires a multilevel approach to evaluating mediation. However, 
additional complexity arises when we consider that mediational models necessarily con- 
tain at least three variables: the independent, or predictor, variable (X); the mediating, 
or intervening, variable (M); and the dependent, or outcome, variable (Y). Because each 
of these three components of a mediational model could vary within and/or between 
level-2 units, there exists a family of seven multilevel mediational models. These models, 
summarized in Table 26.1, are described next. In describing these models, I use the ter- 
minology introduced by Krull and MacKinnon (2001), as used by Preacher, Zyphur, and 
Zhang (2010). 

In the context of studying daily lives, a variable measured repeatedly across time 
(i.e., time-varying) has the potential to vary within, as well as between, individuals, and 
is considered a level-1 variable. Examples of level-1, or time-varying, variables include 
fluctuating experiences, mood, or physiological functioning. In contrast, a variable that 
is measured as a characteristic of the individual, presumably because it is assumed to be 
stable at least over the course of the study (i.e., time-invariant), is considered a level-2 
variable. Examples include variables that are typically considered traits, such as gender, 


TABLE 26.1. Possible Combinations of Time-Varying (Level-1) and Time-Invariant 
(Level-2) Variables in Meditational Models 


Model X M Y 

1-1-1 Time varying Time varying Time varying 
1-1-2 Time varying Time varying Time invariant 
1-2-1 Time varying Time invariant Time varying 
1-2-2 Time varying Time invariant Time invariant 
2-1-1 Time invariant Time varying Time varying 
2-1-2 Time invariant Time varying Time invariant 
2-2-1 Time invariant Time invariant Time varying 
2-2-2 Time invariant Time invariant Time invariant 
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ethnicity, and personality. It is worth emphasizing that describing X, M, or Y as level-1 or 
-2 variables is a design issue (although design should typically follow conceptual expec- 
tations) dictated by whether a variable is measured daily and found to vary across days 
(level 1, time-varying) versus measured once and assumed not to vary across the study 
(level 2, time-invariant; an exception to this classification is that a variable might be mea- 
sured daily but not vary within individuals, so it would therefore be treated as a level-2 
variable). To illustrate the concept that the level of a variable depends on design, imagine 
that a developmental scientist wishes to measure depressive symptoms among children. 
If the scientist simply takes a single measure of depression, the assumption is that depres- 
sion is time-invariant (level 2). In contrast, if the scientist measures depression every day, 
and there is empirical evidence of within-person variation, then depression is considered 
time-varying (level 1). Thus, the same construct can be considered either time-varying or 
-invariant depending on the frequency of measurement. 

As summarized in Table 26.1, the different possible multilevel mediation models 
are defined by whether X, M, and Y are time-varying (level 1) or time-invariant (level 2) 
variables. Given that each of X, M, and Y can be of two types (level 1 or level 2), there 
are eight possible combinations of mediational models; however, only seven of these are 
multilevel, in that a 2-2-2 combination is simply a single-level mediational model, and 
would be analyzed using the traditional approach described earlier. Against this back- 
drop of seven potential combinations of level-1 and level-2 variables in the simple three- 
variable (X, M, and Y) mediational model, I next describe two approaches to analyzing 
mediation. The first is the traditional multilevel modeling (MLM) approach, and the 
second is the newer, multilevel structural equation modeling (MSEM) approach. As I 
describe later, the traditional MLM approach is limited in terms of both the types of 
situations it can accommodate and the potential biases of these estimates. In contrast, 
the MSEM approach can be used in all seven situations, and the estimates of mediation 
are unbiased. 


Multilevel Modeling Approach 


Because the MLM approach is described elsewhere in this book (Nezlek, Chapter 20, this 
volume), I do not attempt to explain these models in detail. However, it is worth remind- 
ing readers of the basic multilevel equations: 


Level 1: y; = Bo; + By; (Time-Varying) + r; (4) 
Level 2: Bo; = Yoo + Yo, (Time-Invariant) + Uoj (5) 
Level 2: Bij = Yio + V; (Time-Invariant) + uy; (6) 


The dependent variable in the level-1 equation (Equation 4) is person j’s score at 
time 7, which is modeled as a function of person j’s intercept (Bo, which can represent 
the person’s mean with proper centering) and any time-varying (i.e., variables measured 
at the daily level, whether or not a lag is introduced) predictors (ß,,), as well as random 
error (r,). The level-2 equations (in this case, two equations, given the presence of an 
intercept and single predictor in the level-1 equation), model level-1 parameters from each 
individual (Bo; and B,,) as functions of their intercepts (Yo) and Y,), which can represent the 
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average values across individuals with proper centering), time-invariant (i.e., measured 
at the level of the person, with the assumption of no change over the course of the study) 
predictors (Yo; and ¥,;), and residual between-person variability in intercept and slope (uo; 
and u,,). 

To provide context for these equations, we might imagine that the dependent vari- 
able in the level-1 equation (4) is a daily measure of depression administered on day i 
for child j. A time-varying predictor might be a daily measure of the number of times 
the adolescent was victimized by peers (e.g., hit, pushed, teased, excluded) that day. The 
level-2 equations contain each child j’s intercept (Equation 5, level of depression when not 
victimized; alternatively, the intercept could represent the level of depression at average 
levels of victimization if victimization is centered) and slope (Equation 6, association of 
daily victimization predicting daily depression). To these equations, we can add level-2, 
or time-invariant, predictors. For example, including gender as a time-invariant predictor 
would provide information about whether girls or boys have higher levels of depression 
(Equation 4), as well as whether girls or boys have a stronger connection between the 
experience of victimization and depression (i.e., gender moderation; Equation 5). 

Traditional MLM approaches have been applied to two of the mediation situations 
summarized in Table 26.1: 1-1-1 and 2-1-1 mediation models. In these models, one fits 
two separate multilevel equations of (1) X predicting M and (2) X and M predicting Y. 
The reason that MLM can be used for only two of the seven possible multilevel mediation 
models is that it is necessary for the dependent variables in both equations (i.e., M and 
Y) to be measured at the time-varying level (i.e., level 1). I next describe how MLM can 
be used in these two situations, and point out limitations of these analyses even in this 
small subset of situations. To illustrate both situations, I use the hypothetical model that 
children’s submissive behavior with peers (X) predicts their experiences of victimization 
from peers (M), which in turn predicts children’s level of depression (Y). 


MIM for 1-1-1 Mediation Models 


If X, M, and Y are all measured at the time-varying level (i.e., level 1), then sets of two 
multilevel equations are typically fit to evaluate mediation. The first set predicts the medi- 
ator from the independent variable (i.e., the a path in Figure 26.1): 


Level 1: M; = Bo; + By (X) + r; (7) 
Level 2: Bo; = Yoo + Ho; (8) 
Level 2: By; = Yio + j (9) 


In this set of equations, the time-varying mediator (M) is predicted by the time- 
varying independent variable (X). This prediction can either be fixed equal across indi- 
viduals (with u; in Equation 9 set to 0) or be allowed to vary randomly across individuals 
(Bauer, Preacher, & Gil, 2006; Kenny, Korchmaros, & Bolger, 2003). 

The second set of equations estimates the path of the time-varying mediator (M) pre- 
dicting the time-varying dependent variable (Y), while controlling for the time-varying 
independent variable (X) (paths b and c’, respectively, in Figure 26.1): 


Level 1: Y; = Bo; + Bi; (X) + Ba; (M) + r; (10) 
Level 2: Bo; = Yoo + %o; (11) 
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Level 2: By = Yio + tty; (12) 
Level 2: By; = Yo + tz; (13) 


After fitting these two sets of models, the mediation effect is estimated as the product 
of Yio from Equation 9 and Y,, from Equation 13. 

To illustrate, imagine that a developmental scientist measures children’s submissive 
behavior (X), victimization (M), and depression (Y) at the daily level across multiple days. 
Assuming that these variables each exhibit within-person variability, this would repre- 
sent a 1-1-1 design. Equation 7 would use daily measures of a child’s submissive behavior 
to predict the child’s receipt of victimization on that same day (as described below, it is 
also possible to introduce time lags into these models). Equation 8 would estimate chil- 
dren’s victimization (where x = 0, which could be when they enact no submissive behavior 
in a raw metric or on days of average submissive behavior if X is centered) as a function 
of a mean and randomly distributed individual differences. Equation 9 would model 
the association between daily submissive behavior and victimization in terms of average 
random between-child differences. Equation 10 would predict children’s daily depression 
from both their submissive behavior and victimization on that day. Equation 11 would 
model the average and individual differences in depression when both submissive behav- 
ior and victimization scores equal zero. Finally, Equations 12 and 13 model the mean and 
individual differences in the submissive behavior — depression and the victimization > 
depression associations. 


MLM for 2-1-1 Mediation Models 


In the 2-1-1 situation, X is a time-invariant (i.e., level 2) variable, whereas M and Y are 
time-varying (i.e., level 1). Here, the mediational model is estimated within an MLM 
framework in a way similar to the 1-1-1 situation, except X is now a level-2 predictor. The 
first set of equations of X predicting M (path a in Figure 26.1) is: 


Level 1: M; = Bo; +7, (14) 
Level 2: Bo; = Yoo + Yor (X) + uto; 2) 


The second set of equation provides estimates of M and X predicting Y: 


Level 1: Y; = Bo; + By (M) + r; (16) 
Level 2: Bo; = Yoo + Yor (X) + toj (17) 
Level 2: By = Yio + Wy; (18) 


As with Equation 9, you can constrain path b (M predicting Y) equal across people 
(by fixing wo; to 0) or allow it to vary randomly. After fitting these two multilevel models, 
the mediation effect is estimated as the product of yp, from Equation 15 and y,) from 
Equation 18. 

Although this design and equations are different from the 1-1-1 design, it is plausible 
that a developmental scientist might also use this design to study a model of submis- 
sive behavior > victimization — depression. In this model, the independent variable 
(X) of submissive behavior is measured not at the daily (i.e., level 1, time-varying) level, 
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but instead as a time-invariant state. For instance, the scientist may measure submissive 
behavior at the initial point only under the assumption that submissiveness is a trait-like 
personality feature. The implication of this design decision is that submissive behavior is 
not used as a predictor of daily victimization (Equation 14) or depression (Equation 16), 
but instead predicts individual differences in victimization (Equation 15) and depression 
(Equation 17) averaged across days. 


Limitations of the MLM Approach to Evaluating Multilevel Mediation 


Although the MLM approach to evaluating mediation has been used in numerous studies 
(for a review, see Chapter 9 in MacKinnon, 2008), there are three key limitations (for a 
more complete elaboration, see Preacher et al., 2010). 

First, the results of these models are frequently misinterpreted. It is important to 
keep in mind that time-invariant predictors (X or M) can never predict across-time varia- 
tion in a time-varying outcome (M or Y); instead, time-invariant predictors can predict 
some composite (e.g., the average) of time-varying outcomes at the level of the individual. 
For instance, a relatively stable trait within a person (e.g., gender, personality) cannot 
predict whether that person is high or low on a time-varying variable on a given day (e.g., 
that the person will be happy today and sad the next), only the typical levels of that vari- 
able across all days studied (e.g., that the person will be happy on most days and sad on 
few). 

Second, the MLM accurately separates variability at the daily (level 1) and individual 
(level 2) levels of the outcome variables (i.e., Yin Equations 10-13 and 16-18; M in Equa- 
tions 7-9 and 14 and 15). However, it fails to separate variability in the time-varying 
predictors (i.e., X in Equations 7 and 10; M in Equations 10 and 16) into daily and 
individual levels. Therefore, time-varying predictors (both X and M in the 1-1-1 model; 
M in the 2-1-1 model) in the multilevel equations confound daily (level 1) and individual 
(level 2) variability. For instance, in the hypothetical example in which victimization (M) 
is conceptualized as a mediator between submissive behavior (X) and depression (Y), 
the daily measure of victimization appears only in the level-1 equations; between-person 
variability in victimization is not used as a predictor of between-person variability in 
depression. This confounding is problematic because it is likely that variables measured 
at the daily level also have interindividual variability in nearly all social science data (e.g., 
even if victimization is measured at the daily level, there are likely individual differences, 
with some children being more victimized than others on a consistent basis). 

The third limitation of the MLM approach is that it can be used only with a small 
portion of the possible mediational questions. As summarized in Table 26.1, there are 
seven potential multilevel mediational designs, and the MLM approach can be used only 
(even if imperfectly) for two of these. Preacher and colleagues (2010) proposed MSEM as 
a more general approach to evaluating multilevel mediation, which I turn to next. 


Multilevel Structural Equation Modeling Approach 


MSEM, a relatively recently developed analytic approach, is a flexible tool for evaluat- 
ing multilevel mediational models. Because I do not assume that all readers are familiar 
with traditional structural equation modeling (SEM), I first provide a brief overview of 
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SEM and how it is used to evaluate mediation. I then describe MSEM in the context 
of multilevel mediation analysis. The (matrix algebraic) equations underlying SEM are 
somewhat daunting to newcomers, so I describe these approaches using figures (path dia- 
grams) rather than equations. Furthermore, my goal is not to provide a complete “how 
to” guide to performing MSEM mediation analyses (that would take a whole book by 
itself); instead, my goal is that the reader appreciate the flexibility of this approach to see 
the value of multilevel mediation analyses. 


Structural Equation Modeling 


SEM uses observed variances and covariances among variables to estimate parameters 
within an a priori specified model. Such models can be thought of as the simultaneous 
analysis of two analytic issues: measurement and predictive models. The measurement 
portion of SEM estimates the relations of measured (i.e., manifest) variables in data with 
an unobserved (i.e., latent) construct. For instance, a researcher might acquire ratings 
on the extent to which participants feel sad, have diminished interest in activities, and 
experience a loss of energy. Presumably, these three manifest variables are believed to 
serve as indicators of an underlying latent variable that might be called depression. The 
measurement portion of SEM evaluates the strength of each manifest variable with the 
latent construct, represented as a factor loading, with the expectation that internally 
consistent scales will have high factor loadings across manifest variables. The variance 
in a manifest variable that is not shared with the other manifest variables serving as 
indicators of the latent variable is termed residual variance. SEM also provides indices of 
model fit that indicate how closely the model evaluated reproduces the data. In the mea- 
surement portion of SEM, poor model fit might occur if manifest variables have unmod- 
eled dual loadings with other constructs or residual variance (variance not correlated 
with the latent construct) that correlates with the residual variance of other manifest 
variables (e.g., if two manifest variables share a method variance not shared by other 
manifest variables). 

The second portion of SEM involves predictive models. The general logic of predic- 
tion in SEM is similar to that of traditional (i.e., manifest variable) regression, in that 
some variables serve as independent variables, predicting other variables that serve as 
dependent variables. SEM is advantageous over manifest variable regression, however, in 
that the associations among latent constructs are disattenuated for measurement unreli- 
ability. Measurement unreliability (i.e., less than perfect internal consistencies) attenuates 
bivariate associations, and can therefore have difficult-to-predict effects on regression 
parameters. (The regression coefficient of X, predicting Y will be reduced by attenuation 
of the X, with Y correlation. But this regression coefficient might be inflated in a multiple 
regression due to the attenuated association of X, with X, when X, is an important cova- 
riate.) In SEM, the predictive relations of interest are most often specified among latent 
variables. Because latent variables capture the shared—or reliable—variance from among 
the manifest variables serving as indicators (recall that unshared variance is modeled as 
residual variance), the predictive relations among them are not attenuated by unreliabil- 
ity. Therefore, the estimated predictive relations from SEM are more accurate than those 
from manifest variable analyses. 

Figure 26.2 displays a pictorial representation (called a path diagram) of how media- 
tion is assessed in an SEM framework (see also Little, Card, Bovaird, Preacher, & Cran- 
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FIGURE 26.2. Structural equation modeling (SEM) representation of mediation. Rectangles are 
manifest (observed) variables. Ovals are latent variables. 


dall, 2007; Little, Preacher, Selig, & Card, 2007). Notice that Figure 26.2 looks similar 
to the conceptual Figure 26.1b. The key difference is that Figure 26.2 is explicit in defin- 
ing manifest variables (the rectangles) as indicators of latent variables (the ovals), and the 
mediational process being modeled among the latent variables. Rather than estimating 
two separate multiple regression analyses (Equations 1 and 12), SEM simultaneously esti- 
mates all of the relevant mediational paths: a (path of latent variable X predicting latent 
variable M), b (path of latent variable M predicting latent variable Y), and c’ (path of 
latent variable X predicting latent variable Y) (as well as numerous other parameters; e.g., 
factor loadings and residuals variances of the manifest variables). The mediation effect 
can be readily calculated by multiplying the parameter estimates of the a and b paths. 
The standard error of this effect (for significance testing and/or confidence intervals) can 
be calculated from the standard errors of these two parameter estimates (using Equation 
3), but most SEM packages can easily provide the more preferred bootstrapped standard 
errors. 


Multilevel Structural Equation Modeling 


An important recent advancement in SEM is the development of improved methods of 
modeling multilevel data (for a relatively accessible overview, see Chapter 7 in Kaplan, 
2009). This MSEM approach decomposes variability in the manifest variables into level-1 
and level-2 components, and these multilevel variances and covariances are then used to 
estimate separate models at each level. In the context of studying daily lives, variability 
in manifest variables is separated into daily (level 1) variation and between-person (i.e., 
interindividual; level 2) variation. Across manifest variables, the covariations are simi- 
larly separated into daily and interindividual components. Thus, the “trick” of MSEM 
is that the level-1 (daily) variances and covariances are used to estimate a level-1 model, 
and the level-2 (interindividual) variances and covariances are used to estimate a level-2 
model. The separation of the level-1 and level-2 models allows us to evaluate mediational 
models at both the daily (level 1) and interindividual (level 2) levels, with potentially dif- 
ferent findings at each level. 
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Figure 26.3 displays the MSEM approach to evaluating multilevel mediational mod- 
els (following Preacher et al., 2010). The boxes in the middle of this figure represent the 
manifest variables. Returning to the hypothetical example described earlier, these mani- 
fest variables might be ratings of various aspects of submissive behavior (e.g., giving up 
toys), victimization (e.g., being teased), and depression (e.g., feeling sad). For now, we 
can assume that each is measured daily, so each manifest variable can have both daily 
and interindividual variability, and the daily and interindividual variances might covary 
across variables. The daily variance and covariance are used as information for the lower 
portion of Figure 26.3, which contains both a measurement and a predictive model. The 
daily (level 1) measurement model uses daily covariation in manifest variables to specify 
daily-level constructs X;, M;, and Y;. The predictive model among these daily (level 1) 
constructs specifies a mediational model, and the parameter estimates (specifically, a, 
and b,, with the 1 subscript denoting level-1 estimates) are used to evaluate the daily 
mediational effect. For instance, the a, path might represent the extent that daily levels 
of submissive behavior predict daily experiences of victimization, and the b, path might 
represent the extent that daily experiences of victimization predict daily depression. The 


FIGURE 26.3. Multilevel structural equation modeling (MSEM) representation of mediation, as 
illustrated by a 1-1-1 design. 
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upper portion of Figure 26.3 uses interindividual variance and covariance among the 
manifest variables in a parallel way: Interindividual covariations among indicators of 
the same construct are used to specify individual-level constructs X;, M;, and Y, and the 
predictive portion specifies a mediational model of individual differences among these 
three constructs. For instance, a, might represent the degree that individual differences in 
submissive behavior (averaged across time) predict individual differences in victimization 
(averaged across time), and b, might represent the degree that individual differences in 
victimization (averaged across time) predict individual differences in depression. 

The separation of daily (level 1) and interindividual (level 2) variance among all 
aspects of the mediational model (not just the dependent variables in certain equations, 
such as in MLM) already indicates substantial advantages of the MSEM approach to 
evaluating mediation. Moreover, the pictorial representation in Figure 26.3 helps to clar- 
ify interpretation by making explicit that mediational processes at the daily and indi- 
vidual levels are entirely separate (and that an individual-level construct cannot predict a 
daily-level construct, or vice versa). To illustrate this point using our hypothetical exam- 
ple: It is possible that daily submissiveness leads to Jower victimization, but consistent 
patterns (i.e., individual differences) in submissiveness lead to higher victimization; these 
different patterns would be estimated (as a negative value for a, and a positive value 
for a,) in MSEM. Perhaps the greatest advantage of the MSEM approach to evaluating 
multilevel mediation, however, is that it allows us to evaluate any of the seven multilevel 
situations in Table 26.1, each of which allow us to evaluate different level-1 and level-2 
processes. The reason for this flexibility is, as mentioned, that the level-1 (daily) and 
level-2 (individual) models of MSEM do not need to be identical, or even to consist of the 
same numbers of manifest or latent variables. This means that if any of X, M, or Y are 
measured at individual rather than daily levels, it is still possible to evaluate mediation. 
For example, Figure 26.4 displays the model when the independent variable (X; e.g., sub- 
missive behavior) and mediator (M; e.g., victimization) are measured at the daily level, 
but the dependent variable (Y; e.g., depression) of interest is an individual-level measure; 
this 1-1-2 mediation model could not be evaluated in an MLM framework, but it is read- 
ily evaluated in the MSEM framework (allowing for a full test of individual-differences 
mediation, and for the evaluation of the submissive behavior > victimization association 
at the daily level). 


Applications in the Study of Daily Lives 


To this point, I have presented the evaluation of multilevel mediation models in rather 
generic terms, without specifying the way that time is used. This generic presentation was 
intentional, because I want to make clear that multilevel mediation models can incorpo- 
rate a variety of ways of using time. 

For instance, the multilevel framework can accommodate time as growth curve 
models (see Nezlek, Chapter 20, this volume). With growth curve models, it is possible to 
parameterize nesting of time points within individuals using a latent growth curve model 
(e.g., Bollen & Curran, 2005; Duncan, Duncan, & Strycker, 2009; Preacher, Wicham, 
MacCallum, & Briggs, 2008), thus not framing these models within the traditional mul- 
tilevel context. However, although the multilevel and latent growth curve representations 
are algebraically equivalent (e.g., Curran, 2003), there are times when using one approach 
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FIGURE 26.4. The flexibility of MSEM mediational models, as illustrated by a 1-1-2 design. 


is simply more convenient than using the other (e.g., when time points are unbalanced). 
Regardless of the way that growth curves are modeled, it may be meaningful to evaluate 
questions of mediation among growth curve parameters (e.g., intercepts, linear slopes). In 
other words, you might be interested in evaluating whether change in X predicts change 
in M, which in turn predicts change in Y. 

The general presentation of multilevel mediation in this chapter can also accommo- 
date lagged intraindividual models, as summarized in Figure 26.5. Given a large num- 
ber of days on which X, M, and Y are measured among multiple individuals, multilevel 
mediation models are especially promising. At the daily level (level 1), such models can 
evaluate whether intraindividual variability in X on one day predicts intraindividual vari- 
ability in M the next day, which in turn predicts intraindividual variability in Y on the 
following day. At the level of individual differences (level 2), the model can evaluate 
whether interindividual variability in X predicts interindividual variability in M, which 
then predicts interindividual variability in Y. Such models have two major advantages. 
First, they allow for more realistic modeling of the longitudinal time over which causal 
mediational processes necessarily occur (see Maxwell & Cole, 2007). Second, they allow 
for tests of competing mediational sequences. For example, the alternative hypothesis 
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FIGURE 26.5. Example of a lagged intraindividual mediation model. To reduce the complexity 
of the figure, indicators are shown. The top of the figure shows a mediational model of interindi- 
vidual differences. The bottom of the figure shows a lagged mediational model of intraindividual 
differences, such that X predicts subsequent M, which in turn predicts subsequent Y. Gray arrows 
are modeled but not of central interest in evaluating mediation. 


that daily depression leads to experiences of victimization the next day, which in turn 
leads to submissive behavior the following day, could be tested in this model. 

Regardless ofthe specific way thattime is handled, the multilevel mediation approach, 
especially when analyzed using MSEM, is a flexible approach to modeling mediation 
among both time-varying and time-invariant variables. 


Challenges in Using MSEM to Evaluate Multilevel Mediation 


Despite the flexibility offered by MSEM in estimating multilevel mediation in the study of 
daily lives, several challenges remain. The first of these challenges is simply that MSEM 
requires specialized software and advanced knowledge of SEM. At the time this chapter 
was written, only the Mplus (Muthén & Muthén, 1998-2010) software package could 
estimate the type of multilevel structural equation models described in this chapter (other 
programs can estimate multilevel structural models within the two-group representation 
described by Selig, Card, & Little, 2008), but limits to this approach are beyond the 
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scope of this chapter). In addition to specialized software, the MSEM approach requires 
specialized analytic skills in fairly advanced SEM. 

Other challenges of using MSEM for multilevel mediation are due to the novelty of 
this approach: Currently, many basic data-analytic issues relevant to the study of daily 
lives have not yet been addressed. For instance, moderator hypotheses are common 
(e.g., moderated mediation and/or mediated moderation; Preacher, Rucker, & Hayes, 
2007), yet complete approaches to evaluating moderation within MSEM have yet to be 
developed. Although categorical individual-level moderators may be evaluated through 
multigroup analyses (e.g., Little, Card, et al., 2007), methods of representing continu- 
ous variable interactions—whether with level-1 or level-2 moderators—have yet to be 
developed and are not straightforward. Also, the sample size requirements for evaluating 
mediation hypotheses using MSEM—in terms of both the number of days (level-1 units) 
and individuals (level-2 units)—have not been systematically considered (see Preacher et 
al., 2010). Considerations from both the traditional MLM and the mediation literatures 
are likely relevant, but until systematic studies are performed, it is probably necessary to 
conduct simulations to determine appropriate samples sizes when planning a study (see 
Laurenceau & Bolger, Chapter 22, this volume). 


Conclusion 


Despite these challenges, MSEM is a flexible and valuable tool for evaluating multilevel 
mediation in the study of daily lives. In this chapter, I have argued that mediation models 
are useful in articulating and evaluating processes of influence. Although single-level 
mediation is well studied and commonly used, there has been substantially less effort to 
study mediation in the multilevel context. Part of the problem might have been the limita- 
tions of MLM in evaluating multilevel mediation, including the potential for misinterpre- 
tation, the confounding of level-1 and level-2 variance among the predictors, and the lim- 
ited range of multilevel mediation models that can even be evaluated using this approach. 
MSEM evaluation of mediation models—a recent addition to our data-analytic tools 
introduced by Preacher and colleagues (2010)—overcomes many of these limitations but 
comes with some additional challenges that need to be addressed in future research. Nev- 
ertheless, the recent expansion in possibilities of evaluating multilevel mediation using 
MSEM offers great promise in evaluating process hypotheses in the study of daily lives. 
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motions are variable; they vary over time and over situations, and this variability is 

manifest both within and between persons. Indeed, Mesquita (2010) argues that we 
should replace the term emotion with the term emoting to capture the changing nature 
and contextual sensitivity of this phenomenon. As pointed out by Frijda (2009), in some 
languages (e.g., Russian), emotions are often designated by verbs instead of nouns to com- 
municate the active, reactive, and changing nature of emoting. Sources of emotional vari- 
ability are also well described by Feldman Barrett (2010), who reviews several theoretical 
perspectives on emotional variability; for example, appraisals produce variability, social 
and relationship factors produce variability, and underlying cognitive (e.g., attentional, 
perceptual) and physiological processes produce variability. To this list of perspectives on 
emotional change, we would add a functional perspective; emotions, moods, and feeling 
states provide constant input into the psychological system about important parameters 
of personal and social functioning (Morris, 1992). Emotions are thus both an output and 
an input into the psychological system, and this complexity has been incorporated into 
some models of emotion. For example, Scherer (2009) proposes a dynamical architecture 
model of the component input and output functions of emotion. Larsen (2000) provides 
a cybernetic theory of affect variability and regulation, arguing that affective changes 
result from multiple perturbing processes, but that they exist within a homeostatic system 
designed to keep affective variability within a range of a set point or desired state (i.e., 
Augustine, Hemenover, Larsen, & Shulman, 2010). 

Even though there are many theories and models about the functions of emotions, 
and many perspectives on the sources of emotional variability, all start with the fact 
that emotions fluctuate. This fact is obvious in observations of everyday life, where we 
witness fluctuations in affect in ourselves and in those around us. The ebb and flow of 
affect colors the daily lives of everyone. In psychological studies of emotion, however, 
the modal design is optimized for the study of static variables. In most psychological 
research, dependent variables are measured on one occasion, and conclusions are derived 
about what caused cross-subject variability on that occasion (e.g., experimental condi- 
tions) or what other variables are correlated with that dependent variable. Some studies 


497 


498 RESEARCH APPLICATIONS 


assess emotion twice, before and after some intervening manipulation (the independent 
variable), and the focus is on a one-time shift or change. Entry-level courses in graduate 
statistics typically teach students how to analyze mostly static variables for variance due 
to experimental conditions, or shared variability between variables measured on one 
occasion. Training in conceptualizing and analyzing variables that fluctuate over time 
has not typically been a part of the standard graduate curriculum in most psychology 
PhD programs. 

This has been changing in recent years, however, with the advent of more dynamic 
models of affect and the development of statistical methods for capturing, quantifying, 
and extracting meaning from patterns of temporal variability. We have been writing 
about process approaches that capture temporal dynamics for some time (e.g., Larsen, 
1987, 1989, 1990, 2003, 2007; Larsen, Augustine, & Prizmic, 2009), focusing mainly on 
the conceptual possibilities that come with making intensive observations of people over 
time, as in studies of daily life. Developments in the area of intensive time sampling (e.g., 
Stone & Shiffman, 2007) point to more dynamic research designs, as well as approaches 
to collecting and analyzing time-based data (Kubiak & Krog, Chapter 7, this volume), 
that allow the researcher to capture and parse temporal variability. Several authors in the 
present volume also highlight the theoretical gains (e.g., Reis, Chapter 1, this volume) 
and methodological and analytic (e.g., Nezlek, Chapter 20, this volume) possibilities that 
come with intensive time sampling. This intensive time sampling, or process approach, is 
well suited to the study of emotional variability and will be useful for investigating some 
of the theoretical perspectives on emotion mentioned earlier, as well as existing and yet- 
to-be-thought-of questions about patterns of emotional change. 

Because the intensive observation of people over time results in a formidable amount 
of data, most early researchers restricted its use to idiographic analyses. For example, 
Barker and Wright (1951) conducted a detailed observational study of microprocesses 
that unfold within an individual over time. They focused on a day in the life of 7-year-old 
Raymond from the moment he woke up until the moment he went to bed on an April day 
in 1949, and recorded his every act. Fortunately, with the rise of intensive time sampling 
(i.e., daily diary, experience sampling) and modern computer-based data management 
and analysis, research designs that involve intensive time sampling of large samples and 
the analysis of within-person processes have undergone a resurgence in the psychological 
literature (e.g., Conner, Feldman Barrett, Tugade, & Tennen, 2007). 

By intensively sampling data over time, we can study not just one boy’s day, but 
many people’s days, weeks, months, or even years (e.g., Nuetzel, Larsen, & Prizmic, 
2007). Moreover, we can measure aspects of self-report (affect, symptoms, activities), as 
well as other behaviors, such as capturing naturalistic speech samples (Mehl & Robbins, 
Chapter 10, this volume) or even online performance on reaction time tasks. Using these 
sampling methods, and analyzing both within- and between-person variability, we can 
address questions about emotion that are approachable only if both the person and time 
are included in our dataset. Indeed, including the temporal dimension in the study of emo- 
tion is necessitated by the construct itself. Emotions are thought to arise due to particular 
antecedent events (i.e., they have a referent) and consist of an experience that is intense 
but short-lived (Davidson, 1994). Thus, two of the defining features of an emotional 
experience are subject to temporal processes: intensity and duration (Verduyn, Delvaux, 
Van Coillie, Tuerlinckx, & Van Mechelen, 2009). Emotions are not static; they vary in 
response to any number of events. A temporal approach is required to examine changes 


Emotion Research 499 


in the time parameters of emotional experience. Even in examining mood states, which 
are longer in duration and do not necessarily have a referent, one must consider temporal 
processes. Although a mood may last days or even weeks, the intensity of a mood state 
varies over time. Thus, a primary component of an emotion or mood is change itself, and 
these changes must be taken into account to truly understand emotional processes. 


The Process Approach: A Brief Description 


As we have been writing about it over the years (e.g., Larsen, 1987, 1989, 1990, 2003, 
2007; Larsen et al., 2009), the process approach is really a way of thinking about tem- 
poral data that allows the researcher to formulate unique questions about some phenom- 
enon. Some researchers go to the trouble of gathering daily or intensive time-sampling 
data, then simply collapse over time to get means on particular variables for each person. 
This is fine, if one wants to get a good, reliable estimate of the expected value on some 
variable, but it throws away all the temporal information that is potentially available for 
analysis. For example, what if one wanted to know how changeable people are on some 
variable? In this case, a within-subject standard deviation would be a good parameter, 
one that incorporates the temporal information, in that it averages deviations from the 
person’s mean over time. However, one might also wish to consider other within-person 
measures of variance, such as skewness or kurtosis. 

So this is the first step in applying the process approach to daily data, to inquire 
about how some psychological process will be manifest in daily data, and to think of a 
quantitative way to model that particular process. We give several examples in this chap- 
ter of emotion processes modeled with various within-subject parameters. But essentially, 
if one is interested in change, variability, duration, recovery rate, rise time, co-occurrence, 
lead-lag relationships, forecasting, cycles or rhythms, or other kinds of temporal pat- 
terns, then a particular parameter can be proposed as a way to model that process for 
each participant. Sometimes this modeling can be done with hierarchical linear modeling 
(HLM) techniques, especially if the model is a simple distributional parameter or linear 
correlation. But any kind of modeling can be done on each subject’s data provided that 
the temporal dimension is sampled frequently enough to result in stable estimates of 
model parameters. 

The second step in the process approach, while not absolutely necessary, often adds 
scientific meaning to the question being considered. Having modeled some particular 
process for each subject in the first step, the second step is to inquire about meaningful 
group or individual differences in that process. For example, we might ask to what degree 
daily emotions are linked to the events in our daily lives, and calculate within-subject cor- 
relations as parameters that indicate this linkage for each subject. Then we might analyze 
these for meaningful variability across people. Who are the subjects whose emotions are 
most reactive to life events, and in what ways are they different from people whose daily 
emotions are more free-running and less responsive to the objective circumstances of 
their lives? 

In the remainder of this chapter we give a number of examples of the process approach 
to the study of daily emotion. Our aim is not to be inclusive, encompassing, or even repre- 
sentative. Instead, our aim is to display several different ways that this approach has been 
used to address novel questions in the emotion area. Ultimately, we wanted to share our 
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enthusiasm for this approach with the reader by highlighting a few successful examples 
of its application to innovative questions. 


Examples of the Process Approach to Daily Emotion 


In this section we highlight several examples of the process approach in the study of daily 
emotions from our own and others’ work. Our examples shed insight into a number of 
temporal processes in emotion, such as duration, within-person prediction of occurrence, 
pattern fit, complexity, within-person variance, and moderated frequency characteristics. 
Our intent is to provide a sampling of the kinds of unique questions that can be addressed 
when emotions are assessed intensively over time, and the data-analytic strategy capital- 
izes on the temporal dimension inherent in such data. 


Duration and Pattern in Daily Emotion 


One of the most basic questions when considering the temporal dimension of emotions 
might be how long emotion episodes typically last. More important, what stimuli serve to 
prolong emotions, and what are the consequences of a prolonged emotional episode? Ver- 
duyn and colleagues (2009) examined these issues for five different emotional states (fear, 
anger, sadness, gratitude, and joy) in a series of two intensive time-sampling studies that 
utilized once-daily sampling and a survival analysis. Their results indicated that while 
some emotions tend to be shorter in duration (around 10 minutes: fear, anger, gratitude), 
others (around 20 minutes: sadness and joy) last longer. While episodes that last longer 
than 10-15 minutes tend to decline rapidly within 30 minutes, emotional episodes that 
survive the first 30-minute period after occurrence tend to transform into more endur- 
ing, mood-like states. While these may be considered average trends for emotional epi- 
sodes, any reappearance (physically or cognitively) of the eliciting stimulus can lead to an 
increase in the duration of that episode. Moreover, if the eliciting stimulus is particularly 
important to the individual, or if that individual experiences a relatively intense initial 
reaction to the stimulus, then that emotional episode will also last longer. Finally, while 
results were mixed and somewhat inconsistent across studies, personality does seem to 
play a role in how long certain emotional episodes endure (see also Hemenover, 2003; 
Suls, Green, & Hillis, 1998). Thus, specific emotions vary in their duration, and the 
degree to which an emotional episode endures may be altered by qualities of the eliciting 
stimulus, reappearance of the eliciting stimulus, or trait patterns of affective responding 
(i.e., personality). Not only do specific emotions display unique temporal patterns, but 
more broad and aggregated measures of affective experience also display unique patterns 
over time. 

As another example of duration research, Larsen and Kasimatis (1991) used daily 
sampling to study the duration of common health symptoms, as well as personality and 
emotion predictors of symptom duration. While many researchers are interested in the 
occurrence of illnesses, Larsen and Kasimatis were interested in how long those illnesses 
typically last, once a person is sick with a minor illness, and whether there are individual 
differences in the duration of common illnesses. They hypothesized that personality fac- 
tors might play a negligible role in the onset of symptoms but a larger role in determining 
who is likely to recover faster from minor illness once they do occur. 
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To study the duration of daily illnesses, Larsen and Kasimatis (1991) had partici- 
pants report their symptoms (factored into distress, aches, gastrointestinal issues, and 
upper respiratory issues) and emotions three times a day for 60 consecutive days. To 
quantify the duration of symptoms, Larsen and Kasimatis calculated an autocorrelation 
function. This function basically calculates the correlation between a symptom report 
at time t with symptom reports at times f + 1, £ + 2, t + 3, and so on, to see how far into 
the future one can significantly forecast the presence of symptoms given the onset of the 
symptom at time ¢. Autocorrelations were computed separately for each subject for each 
of the four symptom factors to model individual differences in the duration of symptoms 
over time. For each subject and symptom factor, symptom duration was measured by 
determining the maximum lag at which the autocorrelation remained significantly dif- 
ferent from zero. 

In terms of duration, the upper respiratory factor had the longest average duration, 
with the average autocorrelation remaining significant to over 5 lags (1.67 days) forward 
in time. This finding does not mean that respiratory infections last only 2 days. Rather, 
this lag variable means that respiratory symptoms persist in a probabilistic way that 
allows significant predictability almost 2 days into the future on average. By contrast, the 
shortest duration symptom was found for the gastrointestional factor (1.58 lags, or 6-9 
hours). It is also noteworthy that the distress symptom factor had a very short duration, 
with significant autocorrelation slightly less than 2 lags into the future. The symptoms 
reported in this factor probably have to do with intense but temporary negative affect 
associated with acute stress, which is common enough among college students to produce 
a distinct symptom factor. 

In terms of personality correlates of symptom duration, a pattern emerged such that, 
with the exception of respiratory symptoms, the duration of all other symptoms corre- 
lated significantly (and negatively) with anger control. Individuals who react aggressively 
to frustration, and who easily become hostile and irritable, experience longer duration 
symptoms when they become ill. Neuroticism showed no correlations with duration, even 
though it is often related to self-reported symptoms. Interestingly, the respiratory dura- 
tion parameter did not correlate with any personality variable, even though the duration 
scores on this symptom factor had the largest range of between-subjects scores of all the 
parameters. 

We include this example of illness symptoms here for two reasons. While not exactly 
a measure of affect, assessments of symptoms are similar to assessments of emotion, in 
that they are subjective reports of inner experiences that vary over time. In addition, this 
example of symptom duration clearly illustrates the process approach to (1) capturing a 
within-subject temporal parameter (in this case, the autocorrelation), then (2) examining 
data for meaningful individual differences in that parameter. In our next example, we 
look at the covariation between daily emotion and daily symptoms over time. 


Covariation between Daily Emotion and Other Variables in Daily Life 


In their study of daily symptoms, Larsen and Kasimatis (1991) were also interested in 
assessing the emotional impact of daily health symptoms, and whether some people are 
more or less affected than others by minor illnesses. To assess the linkage between daily 
mood and daily symptoms, they regressed daily mood on each symptom score over the 
occasions of observation for each subject. In this time series correlation between mood 
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and symptoms, they controlled for the previous three mood scores. This has the effect of 
removing any autocorrelation in mood that might contribute to the mood-symptom cor- 
relation, as in the following equation: 


Mood, = a + Bı Mood,_; + B Mood, , + B, Mood, + By Symptom, 


If a subject has a large B, parameter, it means that his or her negative emotions covary 
strongly with the presence of symptoms, even after controlling for autocorrelation in the 
mood measure. Such a subject displays affective changes that co-occur (becoming more 
negative) with changes in health status. 

In terms of the linkage between illness and moods, respiratory infections were least 
linked to day-to-day moods. Apparently, when someone catches a cold, it has little impact 
on daily moods. The distress factor was the symptom with the strongest linkage to daily 
mood. This stands to reason given that items comprising this symptom are relevant to 
strong negative emotions, including “urge to cry” and “trouble concentrating.” Finally, 
moderate levels of linkage were found between aches and gastrointestional symptoms on 
one hand, and daily mood on the other. This suggests that these symptoms are frequently 
accompanied by greater than chance levels of negative affect. However, as with all the 
parameters, there was a large range across persons in the linkage between distress and 
daily affect. 

When examining the correlation between personality variables and the linkage 
between mood and symptoms, a consistent pattern emerged with Type A personality. The 
correlations were significant and negative for three out of four symptoms, implying that 
subjects scoring high on the Type A dimension reported lower linkages between negative 
moods and physical symptoms than subjects scoring low on this personality measure. 
These results indicate that Type A individuals tend not to be bothered by symptoms of 
respiratory infections, aches and pains, or even minor distress symptoms. The one excep- 
tion was with gastrointestinal symptoms, which showed the opposite pattern; apparently, 
stomach and digestive system upset is accompanied by strong negative moods for Type 
A individuals. 

This example illustrates the value of using correlational parameters over time within 
persons to assess a daily emotion process, in this case, the linkage over time between 
one’s emotions and one’s ongoing physical health. One more example can quickly high- 
light the utility of within-subject correlations to assess a particular psychological process, 
in this case, emotional reactivity to daily life events. Larsen and Cowan (1988) were inter- 
ested in the link between life events and daily emotions, and whether this linkage differed 
among persons. They had 62 participants write down their most significant event each 
day for 56 consecutive days, and these events were then scored by a team of raters along 
a hedonic dimension, ranging from extremely good to neutral to extremely bad. Subjects 
also reported their emotions each day, and these were scored on a bipolar dimension of 
extremely pleasant to extremely unpleasant. With these data, the authors correlated each 
subject’s daily emotion score with the rating of their most significant life event each day, 
thereby assessing for each subject the linkage between emotions and the events in his or 
her daily life. Averaging these correlations, Larsen and Cowan reported a moderately 
strong correlation between daily affect and life events, implying that, for the most part, 
people’s daily emotions track the hedonic events in their daily lives. However, there was 
a good deal of variability in the strength of this association, with scores ranging from a 
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strongly positive to a close to zero association. Personality variables were then examined 
to explore if personality factors were related to the linkage (or lack thereof) between daily 
emotions and daily life events. The one significant personality factor was depression. Per- 
sons with more depressive tendencies showed less linkage between their daily emotions 
and life events. Looking at this effect more specifically, we found that depressive persons 
showed very little relation between negative affect and life events; they reported feeling 
negative emotions in the absence of negative life events. Of course, feeling bad for no 
good reason is a hallmark of depression, but this example illustrates how precisely this 
effect can be observed when daily emotion ratings are analyzed over time. 


Rhythmic Patterns in Daily Emotion 


Not only are there individual differences in duration (and predictors of duration) of emo- 
tion and emotion-related experiences, but individual differences also exist in the unique 
patterns that emotions display over time. The examination of rhythmic patterns of emo- 
tion over time can be achieved with the application of a novel analytic approach (at least 
in psychology) called spectral analysis. In spectral analysis (Larsen, 1987), a spectrum of 
sine-cosine waves is fit to each subject’s data over time (accomplished through the fast 
Fourier transform; Gottman, 1981). The technique then determines which particular 
sine-cosine waves, all varying in period (length of one complete cycle), amplitude (height 
of the waveform), and phase (the value at the start of observation), account for variance in 
the data. This tells us the degree to which each subject’s emotions are following a rhyth- 
mic pattern, and whether that pattern is characterized by faster or slower rhythms. For 
some subjects, very fast (frequent changes in emotions) waves may fit better. For others, 
very slow (few changes in emotions) waves may provide a better fit. Following calcula- 
tion of pattern fit, trait personality can be used to determine whether any personality 
characteristics predict the frequency of each subject’s best-fitting wave function. Larsen 
(1987) found that people with high affect intensity, or the tendency to experience extreme 
emotions, showed faster (more frequent changes) wave fits. 

Sine-cosine waves can be used to describe specific patterns of emotion fluctuation 
over time. In addition to being one way to describe rhythms in emotional life, these emo- 
tional patterns may fit our social and/or biological rhythms. A few biological processes are 
known to exhibit circaseptum (7-day or weekly) rhythms (e.g., immune system activity). 
Emotions might also track a weekly calendar. To examine this question, Larsen and Kasi- 
matis (1990) had research participants complete 84 consecutive daily emotion reports (3 
months’ worth of data, scored as daily hedonic balance) and subjected each participant’s 
data to spectral analysis to identify variance accounted for by a 7-day rhythm. Results 
indicated that at an aggregate level, a 7-day oscillator accounted for a very large portion 
of the variance over 84 days, leading to the conclusion that, in the aggregate, hedonic 
tone exhibited a strong weekly rhythm in this population (college undergraduates). 

To further explore circaseptum rhythms in hedonic tone, the process approach was 
applied to the data. Larsen and Kasimatis (1990) calculated the degree to which each 
individual’s data could be fit to a circaseptum pattern (i.e., a 7-day sine wave) in terms 
of variance accounted for in each individual. This was conceptualized as the degree to 
which each subject was entrained to a weekly cycle in terms of daily hedonic balance. 
Next, the degree of entrainment to this circaseptum pattern was examined in terms of 
whether it could be predicted from relevant personality variables. The trait of extra- 
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version was selected as a predictor, because extraverts are known to avoid routine and 
prefer to be more spontaneous in daily life (e.g., they don’t wait till the weekend to have 
a party). Larsen and Kasimatis predicted that introverts would show the most entrain- 
ment to a weekly cycle. Because persons who desire arousing experiences (those high in 
extraversion and sensation seeking) constantly seek them out and show higher levels of 
positive emotions when they engage in arousing behaviors, extraverts should show less 
entrainment to the circaseptan pattern. Indeed, results indicated that those higher in 
extraversion and sensation seeking showed significantly lower entrainment to a circasep- 
tan pattern of daily affect. 

These findings have also been replicated with aural observations using the Electroni- 
cally Activated Recorder (EAR; Mehl & Robbins, Chapter 10, this volume). For example, 
in two samples of recordings, Hasler, Mehl, Bootzin, and Vazire (2008) coded sounds 
associated with positive affect (i.e., laughing, singing, social interaction) and negative 
affect (i.e., arguing, sighing). The occurrence of these sounds was then examined with 
regard to circadian patterns in daily behavior. Indeed, behaviors associated with positive 
affect did lie on a weekly sine wave. Moreover, and consistent with the findings of Larsen 
and Kasimatis (1990) concerning the role of extraversion in pattern fit, extraversion pre- 
dicted higher amplitudes for the curves associated with socialization behavior. Although 
the behaviors associated with negative affect did not show a good fit to the sine-cosine 
pattern, this may be due to the relatively internal nature of the experience of negative 
affect. 

Thus, changes in emotions over time can be represented by a sine-cosine wave that 
generally fits a weekly pattern. These patterns can be observed in both self-report and 
behavioral observations. In addition, the degree to which an individual exhibits this 
specific pattern can be predicted by personality. Emotions and emotion-relevant behav- 
iors may exhibit other specific patterns over known biological rhythms (i.e., 90-minute, 
24-hour, 28-day). Investigations of other patterns of change in emotion can be achieved 
by yet another novel approach to analyzing intensive time sampling data. 


The Complexity of Emotional Experience 


As described earlier, intensive time sampling allows for the examination of specific pat- 
terns of change in emotions over time. In addition to exhibiting temporal patterns of 
change, emotional experience is complex, and this complexity can be lost if one fails to 
include time as a facet of the data. Recent research using intensive time sampling has 
revealed a number of findings that indicate emotional experience is highly varied and 
complex within the individual. For instance, what is the nature of within-person correla- 
tions between emotional states? How can one experimentally describe extreme shifts in 
emotional states? What are the consequences of extreme shifts in emotion? Again, sam- 
pling emotion repeatedly over time is the only way to address these kinds of issues. 
Research regarding the structure of emotional experience has largely focused on 
between-person variation in emotional states. Findings in this area have concluded that 
although arousal plays a key role in affective structure (Feldman, 1995; Larsen & Diener, 
1985, 1992; Russell, 1980), emotional experience can be broadly represented by two 
independent components representing positive and negative affect (Watson, Clark, & 
Tellegen, 1988). While this clean and simple delineation of positive and negative affect 
may represent between-person variation, it is not clear whether this structure represents 
within-person affective experience. Using intensive time sampling, both Zelenski and 
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Larsen (2000) and Vansteelandt, Van Mechelen, and Nezlek (2005) have investigated the 
nature of between-person versus within-person affect structure (see also Zautra, Berk- 
hof, & Nicolson, 2002). Using twice-daily 6-hour retrospective (i.e., “How have you 
felt in the last 6 hours?”) samplings of emotional reports (for 1 month), Zelenski and 
Larsen (2000) found that although between-subject correlations between emotions of 
the same valence (i.e., happy and excited) were high, within-subject correlations were 
rather low. This suggests that emotions of the same valence, while highly related at the 
between-person level, may function in a discrete manner within-person. They also found 
that, somewhat consistent with known between-subjects structure, relationships between 
positive and negative emotions were somewhat small. Vansteelandt and colleagues used 
a different sampling approach (nine reports a day for 2 weeks), choosing to sample par- 
ticipants’ immediate affective states (i.e., “How do you feel right now?”) rather than a 
retrospective report of some time period. While these results confirmed larger between- 
person than within-person relationships between negative emotional states, there were 
few differences for positive emotional states. In addition, at the within-person level, large 
and negative relationships were observed between emotions of opposite valence. The find- 
ings of these studies highlight two important aspects of emotional experience. First, the 
time over which people report their emotions can have dramatic effects on any observed 
results. While this conclusion has obvious consequences for research methodology, it also 
highlights the highly temporal nature of emotional experience. Second, emotional experi- 
ence may be more complex at the within-person level. While between-subjects research 
may yield two broad and orthogonal factors representing positive and negative affect, 
within-person structure is not so clearly and simply defined. This level of complexity at 
the within-person level can also be observed when examining specific types of changes 
in emotions. 


Affective Pulse and Spin 


Common sense might suggest that changing from feeling very happy to slightly happy is a 
different experience than changing from feeling slightly happy to feeling moderately neg- 
ative. In laboratory studies using emotion inductions, researchers are typically interested 
in quantitative changes in emotion states (i.e., a standard deviation shift toward more 
negative emotions). Unfortunately, using typical change measurements, such as points on 
Likert-type scales, two change scores may be identical in absolute value yet quite different 
subjectively if one wholly ignores the qualitative differences in the two types of change. 
Further adding to the problem, it would be impossible to implement an emotion induc- 
tion that caused the same qualitative change in all subjects. Thus, we are left with only 
one option: to examine these types of changes as they occur in the real world. Again, 
intensive time sampling helped to solve this dilemma. 

Kuppens, Van Mechelen, Nezlek, Dossche, and Timmermans (2007) examined these 
qualitatively different changes in emotional experience by having subjects report their 
emotional state nine times a day for 1 week. During each reporting period, subjects 
marked a graph to indicate their current emotional state. This graph had axes represent- 
ing pleasantness and an axis representing arousal and, thus, represented all possible expe- 
riences within the affect circumplex (see Larsen & Diener, 1985, 1992; Russell, 1980). 
Using this graphical representation of affect, Kuppens and his coauthors (2007) intro- 
duced two new ways to conceptualize changes in emotional states that move beyond typi- 
cal measures, such as mean and standard deviation: pulse and spin (see also Moskowitz 
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& Zuroff, 2004). Pulse represents the frequency of change in the intensity of emotional 
experience. Someone with a high pulse score would experience frequent changes between 
neutral and extreme emotional states. Spin quantitatively represents the degree of quali- 
tative shifts in emotional experience, such as a change from experiencing activated pleas- 
ant emotions to experiencing unactivated unpleasant emotions. Individuals with a higher 
spin score would experience more qualitative changes in their emotional states. 

While extraversion and neuroticism were not related to more traditional measures 
of emotion variability (within-person standard deviations for activation or pleasant- 
ness), they were highly related to spin (r = .36 and -.37 for neuroticism and extraversion, 
respectively). Thus, these personality variables are indicative of a tendency toward large 
qualitative changes in affect but not raw variance itself. In addition, spin correlated with 
indicators of poor adjustment. Those who experience more frequent qualitative shifts 
scored higher in depression and lower in self-esteem; high spin may represent a risk factor 
for psychological maladjustment. 


Affective Instability over Time 


The relationships between qualitative shifts in emotional experience and measures of 
adjustment, depression, and self-esteem observed by Kuppens and colleagues (2007) seem 
to provide evidence for a process that has been implicated in a number of psychologi- 
cal disorders: affect instability. Affect instability is characterized by heightened affect 
reactivity (i.e., Larsen & Ketelaar, 1989), and extreme and unpredictable changes in 
emotional experience (Trull et al., 2008). Given the complexity and temporal nature of 
the construct, intensive time sampling is again required to investigate the nature and 
consequences of this facet of emotional experience. This complexity also suggests that 
traditional measures of trait negative affect, such as neuroticism, may not fully capture 
affect instability. 

Miller, Vachon, and Lynam (2009) investigated this possibility by having participants 
rate their negative affect eight times a day for 1 week. To take into account the ampli- 
tude, frequency, and temporal dependence of changes in negative affect, affect instability 
was calculated using the mean square successive difference. As expected, while affect 
instability was related to average negative affect, it was not related to global neuroticism. 
However, affect instability was positively related to the hostility and impulsivity subfac- 
ets of neuroticism. Affect instability was also negatively related to agreeableness, which 
is consistent with research linking high levels of neuroticism and low levels of agreeable- 
ness to psychopathology (Saulsman & Page, 2004). Also consistent with links between 
affect instability and psychopathology, affect instability was highly related to measures 
of depression, anxiety, and anger (Miller et al., 2009). Thus, qualitative changes in emo- 
tional states exist and have important links to daily functioning and psychopathology. 
This type of temporal variation in affect can be examined only when we consider how 
emotional processes unfold over time. 


Daily Data Combined with Other Sources of Data 


While daily emotion data are most often analyzed on their own, creative uses can be 
found when they are combined with other sources of data. As an example of this com- 
bined strategy, Augustine, Mehl, and Larsen (in press) used EAR data to investigate the 
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frequency of use of emotion words in daily speech. We had previously observed in lexical 
databases (generated from analyses of written samples) that there is a correlation between 
the frequency with which words are used and how positively those words are rated. The 
question is, does this relationship exist in actual language use? 

While there are more positive emotion descriptors in language, negative emotion 
descriptors are more differentiated (this effect can be observed in 20 languages; Rozin, 
Berman, & Royzman, 2010). Thus, based on the number of positive emotion descrip- 
tors, one might expect a positivity effect in language use. However, we investigated the 
general positivity of words used, not the use of positive emotion words. In other words, 
do people use words such as puppy or cake more often then they use words such as mur- 
der or cockroach? Consistent with findings from text-generated lexical databases, we 
did find that there is a linguistic positivity effect in everyday speech; people tend to use 
positive words more often than they use negative words in daily life. This finding was 
obtained by taking frequency data from daily speech samples, determining frequency 
counts of a large number of emotion words from these naturalistic samples of linguistic 
behavior, merging these with normative valence ratings of those words, and correlating 
the frequency counts with the valence ratings across words for each subject. The average 
of these within-person correlations showed a definite bias toward more frequent use of 
positive words than negative words in daily life. Moreover, we found that certain per- 
son characteristics moderate this relationship. Women showed a stronger positivity bias 
than men. Moreover, the personality traits of extraversion and agreeableness correlated 
significantly with a stronger linguistic positivity bias. Thus, we again see a marker of 
the complexity of emotional experience. While individuals are generally more positive in 
their speech and writing patterns, the degree to which they show this positivity effect is 
moderated by gender and personality. 


Prospects for the Future of Daily Studies of Emotion 


Our intention here has been to inspire enthusiasm for the unique questions about emotion 
that can be addressed using daily, or intensive time sampling, data. When time is included 
in data capture by intensively sampling emotional experience over many occasions, then 
time becomes an inherent structural component of the data. As such, it can be analyzed 
in unique ways that reveal specific emotional processes. In particular, ideal questions for 
study with these methods revolve around emotional processes that involve patterns of 
change (e.g., cycles or rhythms) or duration or rate of change, or covariation of change, 
either at the same time (co-occurrence) or across time (lagged covariation; e.g., in incu- 
bation periods), or the complexity of covariation. In fact, only intensive time sampling 
methods can adequately address such questions about emotional processes, because pro- 
cess, by its very definition, involves change over time. 

We hope our enthusiasm is contagious and that other researchers can see how study- 
ing a phenomenon of interest in a daily sampling format can open whole new avenues 
of investigation. Given the relative youth of this type of research, all of the findings 
presented here represent only a beginning. For each of the processes described, there are 
nearly limitless opportunities for discovering what types of person- and situation-level 
variables might moderate the observed findings. Regarding duration, for example, how 
does implementing an affect regulation attempt alter the duration of emotions, and does 
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this potential alteration differ based on the strategy used (e.g., Augustine & Hemenover, 
2009) or the specific emotion targeted? For the prediction of occurrence, what specific 
types of life events have the strongest impact on within-person relationships between sit- 
uation and affect? To what extent would life events alter the rhythmic patterns observed 
for affective experience? Concerning the complexity of affective experience, do any per- 
sonality variables (neuroticism, depression, etc.) predict the degree to which one’s within- 
person affect structure matches known between-person affect structure? Are there other 
ways to measure within-person variance in emotion that provide a more realistic and 
fine-grained view than a simple standard deviation? Finally, are there other behavioral 
measures, in addition to language use, that we can use to get a clearer picture of emotion 
in the real world (as in Goodwin, Chapter 14, this volume)? 

Although these questions concern existing research on temporal processes in emo- 
tion, research questions that have traditionally been answered with cross-sectional meth- 
ods can also be addressed with intensive time sampling and the process approach. A 
number of affectively relevant research topics would benefit from an examination using 
intensive time sampling, such as affective forecasting, affective influences on decision 
processes, or the influence of cognitive abilities on affective experience. However, the 
real power of intensive time sampling data collection is gained when these methods are 
used to leverage the temporal dimension into the research questions under investigation. 
This allows the rigorous investigation of various unique temporal processes. Given their 
variable nature, the inclusion of temporal processes as an aspect of the data is required to 
fully understand emotion and personality. Intensive time sampling, in which investigators 
can pinpoint the size and shape of changes in various phenomena, is increasingly being 
used by researchers (Khoo, West, Wu, & Kwok, 2006). Standards are being proposed 
for the application, analysis, and reporting of data so gathered (Stone & Shiffman, 2007; 
Stone, Shiffman, Schwartz, Hufford, & Broderick, 2002). As is attested to by the number 
of researchers now successfully adopting these methods, the technology for efficient and 
creative data gathering and analysis is growing by leaps and bounds. 

Given these developments, the future of intensive time sampling and the process 
approach is limited only by the creativity and ingenuity of researchers. Any researcher 
who is seriously interested in studying processes over time, especially catching emotion 
and personality in the act as temporal processes, would benefit from adopting this method 
of investigation. 
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CHAPTER 28 


Close Relationships 


SHELLY L. GABLE 
COURTNEY L. GOSNELL 
THERY PROK 


The Ubiquity of Close Relationships 


Across the lifespan, our health, well-being, and development are closely tied to the qual- 
ity and course of close relationships (see Reis, Collins, & Berscheid, 2000, for review). 
For example, early attachment relationships play a critical role in infants’ and children’s 
development (see Bowlby, 1969; Rutter, 1979), and for older adults, myriad relationships, 
such as those with children, friends, family, and romantic partners, have an impact on 
well-being (for a review, see Blieszner, 2006). Past work shows that high-quality, sup- 
portive relationships are linked to healthier immune functioning, lower rates of mortal- 
ity, and reduced onset of diseases (e.g., Cohen, 1988; House, Landis, & Umberson, 1988; 
Kiecolt-Glaser, Gouin, & Hantsoo, 2010). Research on happiness has also suggested that 
having good social relationships may be a necessity for general happiness and satisfaction 
with life (Diener & Seligman, 2002). 

A lack of social ties, the disruption of close relationships, or the presence of conflict- 
ual or hostile relationships has also been linked to decrements in health and well-being. 
For instance, loneliness has been associated with poor sleep, lower medical treatment 
adherence, and poorer health overall (Cacioppo & Patrick, 2008; Segrin & Passalacqua, 
2010). Losing an important close relationship, such as through the death of a spouse, has 
been linked to decreases in physical health and healthy behaviors as well as increases in 
depression and anxiety disorders (Eng, Kawachi, Fitzmaurice, & Rimm, 2005; Onrust 
& Cuijpers, 2006). 

Problems or conflicts in existing close relationships can take a physical toll on an 
individual’s health and well-being. Marital discord has been associated with reduced 
recovery and immune function (Kiecolt-Glaser et al., 2005), and marital functioning 
has been found to be a strong predictor of survival of a medical emergency (Coyne et 
al., 2001). The importance of close relationships has not gone unnoticed in the theo- 
retical literature; many theorists have argued that the need to connect with and main- 
tain close relationships is universal and fundamental (Baumeister & Leary, 1995; Deci 
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& Ryan, 1985). It is likely that relationships took on such psychological prominence 
because of their importance for survival, such as through protection, shared resources, 
and increased chances of reproduction over our evolutionary history (e.g., Barkow, Cos- 
mides, & Tooby, 1992). 

Close relationships are so integral to health and well-being that they are part of the 
fabric of our everyday lives. Close relationships are built on the repeated interactions we 
have with partners over time, and in fact, individuals spend a good amount of time each 
day in the company of close others. A study by Milardo, Johnson, and Huston (1983) 
found that participants reported an average of four to five voluntary interactions per day 
(spending about 6 hours of their day) with individuals in their social networks. In addi- 
tion to actual interactions with others, people also engage in a wide variety of thoughts, 
feelings, and behaviors surrounding their close relationships every day. For example, they 
may think about the meaning of a friend’s ambiguous comment or daydream about an 
upcoming family reunion, they may avoid a spouse after a heated argument or buy flow- 
ers at the local shop to make up, or they may feel excited about an upcoming first date or 
sad at the loss of a beloved great aunt. In short, close relationships are a ubiquitous and 
important component of individuals’ everyday lives. 


Between-Person and 
Within-Person Approaches to Studying Relationships 


One strategy that researchers interested in close relationships can take is a between- 
person approach; that is, they can examine individuals possessing certain characteristics 
or situated in particular circumstances in an effort to understand relationship processes. 
In this design, the investigator examines how people who differ along some theoretically 
defined dimension behave on variables of interest, or how people in general respond to 
situational variations. Thus, researchers interested in studying compassion might relate 
levels of reported compassion in friendships to dispositional variables such as extraver- 
sion or sex. Similarly, researchers interested in the effects of rejection on cortisol produc- 
tion might design experiments in which people are randomly assigned to either a rejection 
condition or a control condition. Clearly, the between-person approach has provided 
much valuable information about close relationship processes and their outcomes. How- 
ever, we know that at any given time people have close relationships with different part- 
ners in their social network, they interact with the same partners in different contexts 
and roles, and their relationships fluctuate and evolve over time. In fact, the very defini- 
tion of a close relationship, according to many theorists, is one in which the partners have 
mutual influence across different contexts over time (e.g., Kelley & Thibaut, 1978). Thus, 
examining close relationships in the variety of contexts in which they unfold over both 
short and long time intervals, and with myriad partners, is crucial. Indeed, many aspects 
of close relationships (e.g., length of involvement, conflict, dissolution) are difficult or 
unethical to manipulate or simulate in traditional laboratory experiments; thus, studying 
these variables in situ is the best—and sometimes the only—option. 

In this chapter, we contend that methodologies developed for daily experience stud- 
ies are critical to enhancing our understanding of close relationships because they inher- 
ently take a within-person approach (see Hamaker, Chapter 3, this volume). The within- 
person approach is ideal for examining variations in relationship phenomena manifested 
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across different relationships, contexts, and time. For example, a researcher interested in 
social support can examine perceptions of availability of social support from a spouse 
over time during a prolonged medical treatment. Receipt and provision of social support 
can be studied across different partners, such as support from a romantic partner, parent, 
child, or friend. Or the effectiveness of social support provided by a spouse can be stud- 
ied in the context of work-related, financial, and social stressors. Moreover, individual 
differences, such as personality factors, relationship qualities, or general frameworks of 
relationship models (e.g., attachment dimensions) can be examined as moderators of any 
of the within-person variability observed in the preceding examples. 


Relationships Research through a Between-Person Lens 


A great deal of research on close relationships is conducted from a between-person per- 
spective: Data are collected at one time point (e.g., level of empathy), in a particular 
context (e.g., empathy after a disagreement), or in a particular relationship (e.g., empathy 
for a best friend). One typical between-person design compares people who possess a par- 
ticular characteristic to those who do not have that characteristic, or to those who have 
different levels of it. For example, in an interesting study, Fincham, Paleari, and Regalia 
(2002) found that people who felt greater empathy for their spouse overall were more 
likely to forgive their spouse for transgressions than those with less empathy, and this 
was more true for men than for women in this sample. In another typical design, people 
are placed in a certain situation, and processes of interest are examined. For example, an 
important study by Feeney and Collins (2001) placed one member of a romantic pair in 
a stressful laboratory situation and examined the support behaviors offered by the other 
person. They found that individual differences in attachment beliefs of the support pro- 
vider predicted patterns of support offered under stress. 

The between-person approach has yielded valuable information in close relation- 
ships. However, it is our contention that the very nature of close relationships should 
prompt us to consider more readily a within-person perspective. Moreover, we maintain 
that daily experience methodologies are uniquely suited to examine relationship processes 
from a within-person perspective. It is this perspective that will allow us to understand 
more conclusively relationships across time, context, and partners. Ultimately, combining 
between- and within-person approaches will yield a clearer picture of the role that rela- 
tionships play in myriad psychological processes, as well as in health and well-being. 


Relationships Research through a Within-Person Lens 


Research on close relationships is particularly well suited to a within-person approach. 
This way of conceptualizing research questions is ideal for examining processes of inter- 
est as they manifest across different relationships, contexts, and time. The daily experi- 
ence methods that are the subject of this book are ideal tools for answering within-person 
questions because they take advantage of naturally occurring relationship processes out- 
side of the laboratory (see Reis, Chapter 1, this volume). In particular, daily experience 
methods require that participants rate the variable(s) of interest on multiple occasions, 
within multiple relationship partnerships, or within different contexts. Of course, from 
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a statistical standpoint, multiple data points allow for greater sensitivity in measurement 
of a given process or phenomenon. Beyond this, however, multiple data points allow close 
relationship scientists to examine consistency and variation of the sampling units, which 
has important theoretical implications. Researchers who take a within-person perspec- 
tive are forced to think dynamically about relationship processes and measure them as 
such. Their theories are stretched and challenged to address consistency and variation 
across time, contexts, and partners. The within-person approach is an important concep- 
tual decision, not just a methodological tool or statistical tactic (see Hamaker, Chapter 
3, this volume). 


Framing Close Relationships Research Questions 
and Developing Theories 


When constructing research questions, attention to the between-person and within- 
person distinction yields important empirical and theoretical issues. Consider the follow- 
ing example: When we study social support in relationships, we might try to understand 
how relationship satisfaction differs from couple A to couple B as a function of their 
level of social support. However, if our empirical investigation and conceptualization 
is limited to this between-person strategy, simply understanding the level of satisfac- 
tion or social support overlooks the effects of time, context, and partner factors. More 
clearly, when we ask someone, “How supportive is your partner?,” the answer may be 
based on comparisons of his relationship to those of other couples (between-person), 
how much social support his wife provides now compared to when they first met (time), 
how supportive she is today compared to a particularly stressful period in their rela- 
tionship (context), or how supportive she is compared to how supportive his mother is 
(partner). Thus, by ignoring the within-person perspective we are assuming a priori that 
there is consistency in the phenomenon of interest, or at the very least implying that any 
inconsistency is meaningless measurement variance. In addition, possible interactions of 
between-person and within-person effects aid in understanding that could shed light on 
complex moderating processes. 

Close relationship theories and research questions should be constructed to cap- 
ture the dynamics of time, context, and partner effects. Moreover, theories and research 
questions that have the greatest potential for understanding the complexities of relation- 
ships simultaneously incorporate a between- and within-person perspective. To illustrate 
this, we reconsider the earlier social support question in terms of person, time, context, 
and partner. As depicted in the hypothetical data in Figure 28.1, the researcher can ask 
whether perceptions of social support from a spouse increase or decrease across days in 
a 2-week period during a stressful event. Moreover, individual differences in the person 
(e.g., gender), relationship (e.g., newlywed or long-married), or environmental context 
(availability of other network members) can be explicitly modeled as moderators and 
predictors of these time changes. Thus, although they start out at similar levels, A in 
Figure 28.1, who has high self-esteem, shows a linear increase over time in perceptions of 
support, and B in Figure 28.1, who has low self-esteem, shows a linear decrease over time 
in perceptions of support. Unpacking this effect and understanding the process of change 
over time would be not only more accurate in terms of representing the pattern of data 
(not merely a mean difference) but also more rich in terms of our theoretical knowledge 
of social support than data taken from a strictly between-person approach. 


Close Relationships 515 


hessen r Person A’s mean 


Sample Mean 
Each Day 


Person B’s mean 


Level of Social Support 


123 4 5 6 7 8 9 10 11 12 13 14 
Days 


FIGURE 28.1. Graphic representation of two hypothetical participants in an interval-contingent 
daily experience study of social support from their spouse over the course of 2 weeks. Social sup- 
port across time. 


Similarly, Figure 28.2 shows hypothetical perceptions of support during interactions 
with different partners in one’s social network. As is illustrated clearly, there are mean 
differences between A and B overall; however, the unique relationship each has with 
members of the network accounts for a large portion of the variance as well. In terms 
of theory development, unpacking these data requires the explicit adoption of a dyadic 
perspective—examining what factors of the relationship partner or the interaction of the 
participant and the relationship partner account for this variability. Finally, Figure 28.3 
depicts the level of social support from a friend when the nature of the stressor (context) 
varies. As can be seen in this illustration, considerable variation may be attributable to 
the context of the support interaction. Understanding the factors that account for these 
changes (e.g., if the context presents a threat to the support provider) would provide 


| mmennnnnnnn nn dann nn 7 a ER Person A’s mean 


Sample Mean 
for Relationship 


Person B’s mean 


Level of Social Support 
OBWNHRFORFNWAU 


Relationship Partners 


FIGURE 28.2. Graphic representation of two hypothetical participants in an interval-contingent 
study of social support received from different network partners. Social support across different 
partners. 
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FIGURE 28.3. Graphic representation of two hypothetical participants in an observational study 
of perceptions of social support from their best friend in an event-contingent daily experience 
study of different stressors—work, child/parenting, financial, family member, social. Social sup- 
port from best friend across different contexts. 


unique insight into the nature of support provision and receipt. In short, relationship 
theories can greatly benefit from a within-person perspective. 


The Utility of Daily Experience Methods 
in Relationships Research 


Although daily experience methods are a prototypical example of within-person 
approaches, they are by no means the only way to conceptualize them. Reis and Gable 
(2000) outlined three general types of within-person designs: exemplary experience, 
reconstructed experience, and ongoing experience. The first, exemplary experience, 
refers to studies conducted in specialized settings and contexts, such as laboratories, liv- 
ing rooms, and therapy offices. A researcher employing such a design, for example, can 
determine the impact of different contexts (e.g., high and low stress) on social support 
and social interaction within a dyad. Or, within a fixed setting, one might examine the 
same individual’s behavior with different partners. For example, one could videotape 
support provision for a negative event to a spouse and then to a child. The second type 
identified by Reis and Gable, reconstructed experience, is general, global, or recollected 
accounts of the variable of interest. For example, participants may be asked to summarize 
past experience with a partner over time (e.g., “How much has support changed since 
you married?”), or across contexts (e.g., “How much support do you get when you have 
a problem at work versus a problem with another family member?”), and across partners 
(e.g., global ratings of support from different network members). 

The final category of measurement identified by Reis and Gable (2000), ongoing 
experience, is assessed with the daily experience methods that are the focus of this volume. 
Ongoing experience studies examine thoughts, feelings, and behaviors in their everyday 
naturalistic contexts. They include long-standing methods such as the Experience Sam- 
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pling Method (Csikszentmihalyi & Larson, 1983), Ecological Momentary Assessment 
(Shiffman & Stone, 1998), and the Rochester Interaction Record (Reis & Wheeler, 1991), 
as well as more recent advancements, including the Electronically Activated Recorder 
(Mehl & Robbins, Chapter 10, this volume). Ecological validity is a major benefit of such 
studies and is of great concern in close relationships research. This is because it is nearly 
impossible and oftentimes unethical to create the emotional contexts (“We are breaking 
up!”), behavioral options (storming out of the room), and partner variations (“Should I 
call my brother or my mother about this?”) that make up the rich and powerful foun- 
dation of a close relationship. Nor do other designs allow for the assessment of lagged 
effects (e.g., feelings of resentment the day after conflict). Again, the daily experience 
design is integral to clarifying theoretically important contextual, temporal, or dyadic 
processes. 

Overall, there are three general strategies for sampling ongoing experience (Wheeler 
& Reis, 1993). Interval-contingent studies obtain data at regularly scheduled intervals 
(e.g., once per day, once a week). Signal-contingent methods require subjects to complete 
a report whenever a stimulus prompt is received, usually from electronic personal data 
devices; or data are automatically collected by a special device (e.g., ambulatory blood 
pressure monitor) and a signal. Signals may follow a fixed or random schedule, or may be 
randomized within fixed intervals. Event-contingent reports are obtained whenever rel- 
evant events, such as a social interaction or conflict with a partner, have occurred. Each 
of these strategies is ideal for different research questions, as reviewed by Reis and Gable 
(2000; also see Conner & Lehman, Chapter 5, this volume). 


Dyadic and Group Data 


One of the advantages of daily experience methods is that researchers can concurrently 
sample both members of a particular dyad (e.g., both romantic partners) or even sam- 
ple more than two members of a social unit (e.g., family members or sorority sisters). 
This type of design opens up potential hypotheses about cross-partner effects; that is, 
researchers can test hypotheses about how Partner A’s thoughts, feelings, or behaviors 
influence Partner B’s thoughts, feelings, or behaviors. For example, Algoe, Gable, and 
Maisel (2010) showed that one partner’s feelings of gratitude predicted the other part- 
ner’s relationship satisfaction. Time, context, and partners can also be incorporated into 
hypotheses about lagged day effects. 

Another advantage of having more than one member of a social group completing 
records simultaneously is that each member can report on the same event, allowing the 
researcher access to multiple perspectives on the same event. For example, Gable, Reis, 
and Downey (2003) used a strategy based on classic signal detection theory to assess the 
impact of partners’ behaviors on one another during everyday interactions. Specifically, 
they had couples respond to a checklist of daily events (e.g., “My partner criticized me, 
I criticized my partner”; “My partner told me he loved me, I told my partner I loved 
him”). They then examined whether both members agreed that the person enacted or 
did not enact the behavior (hits and false alarms), or whether one person said the behav- 
ior occurred but the other disagreed (misses and false alarms). These different types of 
agreements and disagreements showed a different pattern of association with relation- 
ship outcomes. For example, increases in relationship satisfaction were associated with 
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positive behaviors only when the participant reported seeing the behavior (hits and false 
alarms), but decreases in satisfaction were observed on days that either the participant or 
the partner reported negative behavior (hits, false alarms, and misses). 

One challenge of methods that have more than one member of a social group com- 
pleting daily experience records at the same time lies in ensuring that all participants 
independently complete the records, regardless of whether they are reporting on the same 
events. There are techniques that increase the likelihood of independent observations. 
A popular method is to have each person’s data collection device password-protected. 
Researchers also should provide to participants a clear rationale as to why it is important 
to complete records independently, to not discuss what has been recorded, and to respect 
the confidentiality of the research process. Of course, there are several other methodolog- 
ical challenges associated with daily experience records, such as ensuring participants’ 
willingness and ability to remain engaged with the demands of multiple measurement 
points, and avoiding participant burnout and reactivity to items. Reis and Gable (2000), 
Conner and Lehman (Chapter 5, this volume) and Barta, Tennen, and Litt (Chapter 6, 
this volume) provide overviews of these more general issues. These issues also pertain to 
daily experience studies that have dyads or groups of social partners as participants. 


Data Analysis 


It is beyond the scope of this chapter to discuss data-analytic procedures for daily experi- 
ence studies; this topic is covered in detail in other chapters in this volume. However, it 
is necessary to emphasize that when both members of a dyad or multiple members of a 
social unit complete daily experience studies, additional steps must be taken to account 
properly for this type of nonindependence of data (see Laurenceau & Bolger, Chapter 22, 
this volume). That is, in studies in which only one person from a social unit participates, 
each person still contributes multiple data points, and each data point shares in common 
some variance because all data points are nested within the person; however, in dyadic 
or group daily experience studies, additional covariation exists not only within each indi- 
vidual’s data points but also between individuals in the same social unit. In some cases 
this represents another level of nested data, and in others it represents a crossing of two 
levels. Moreover, sometimes a variable of interest can clearly distinguish one member of 
a dyad from the other (e.g., sex in the case of heterosexual spouses), and other times the 
members of the social unit are indistinguishable (e.g., same-sex friendship pairs). Each of 
these cases requires a different approach to data analysis. We refer the reader to several 
sources for analyzing dyadic data produced by diary studies: Atkins (2005), Barnett, 
Raudenbush, Brennan, Pleck, and Marshall (1995), Kashy, Donnellan, Burt, and McGue 
(2008), Kenny, Kashy, and Cook (2006), and Laurenceau and Bolger (2005). 


Close Relationships Research and Daily Experience Studies 


In the past decade an explosion of close relationships research employing daily experi- 
ence methods has been published. This is especially notable in contrast to the relatively 
small number of studies published when Gable and Reis (1999) wrote a primer on daily 
experience studies for a special issue on methodology for the journal Personal Relation- 
ships. In this section we highlight some of the research that has examined close relation- 
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ship processes over time, context, and partner. However, readers should note that this is 
but a sampling of the many studies on close relationship processes in daily life that have 
contributed to the literature. 


Time 


Daily experience methods have allowed researchers to examine changes in relationship 
processes across time and how between-person factors might moderate these patterns. 
For example, in a 6-month study with biweekly reports of sexual desire in relatively new 
romantic relationships, Impett, Strachman, Finkel, and Gable (2008) found evidence for 
a linear pattern of decreased sexual desire over time. However, they also found that this 
was moderated by whether individuals had strong or weak approach relationship goals 
(goals focused on obtaining desired end states). Individuals with strong approach goals 
retained high levels of sexual desire over the course of the study, whereas those with 
weak approach relationship goals experienced declines in sexual desire over the course 
of the study. Another study examining the emotional patterns of nonmarital relation- 
ship breakup found that dissolution participants reported more anger and less love over 
the 28-day study than did dating participants (Sbarra & Emery, 2005). However, the 
researchers also found over the course of the study that dissolution participants’ reports 
of sadness declined significantly and that by the end of the study, dissolution participants’ 
reports of sadness did not differ from those of the dating participants. 

In another example, Fisher and Kim (2007) examined the usefulness of a foster care 
intervention program using daily experience reports on child behaviors collected over 
the course of a year. This study is unique, in that it is one of a few that combined an 
experimental manipulation (intervention) with daily experience methods. The research- 
ers found their intervention to be successful in increasing the presence of secure behaviors 
and decreasing the presence of insecure behaviors over time. Furthermore, they were able 
to document an interesting interaction, in which initial age of the child (between-person 
variable) affected the trajectory over time. Younger children showed larger increases in 
secure behavior over time in the control condition, whereas older children showed the 
greatest gains in secure behavior in the intervention condition. 


Partner 


Although not as common as studies of variation across time or context, daily experi- 
ence methods also allow variations in relationship processes across different partners, 
as well as how between-person factors might moderate these patterns, to be examined. 
For example, in a daily diary study spanning 1 week, 15- to 16-year-old adolescents with 
secure and insecure attachments to their mothers or fathers were found not to differ in 
the frequency of reported conflict with their parents (Ducharme, Doyle, & Markiewicz, 
2002). However, when examining conflict in peer interactions, researchers found that, 
compared to adolescents with secure attachment, adolescents with a dismissing attach- 
ment reported more conflict in peer interactions. This finding suggests that attachment 
security influences parent and peer interactions differently. 

A study by Harlow and Cantor (1995) nicely demonstrated how individual differ- 
ences moderated relationship processes across partners, specifically by directing the types 
of individuals with whom people spend time when things are going poorly. Daily diary 
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data revealed that individuals who focus on social life outcomes (a concern with their 
social performance and appraisal of social tasks as difficult) tend to respond to poor 
social well-being by spending time with partners who previously have provided them 
with emotional support. In contrast, individuals who focus on improving their social lives 
(striving toward an ideal) respond to poor social well-being by spending time with part- 
ners who embody their self-ideals and facilitate self-improvement. Gable, Reis, Impett, 
and Asher (2004) asked a slightly different question and assessed who, if anyone, people 
turn to when positive things happen. They found that, surprisingly, people talked about 
everyday positive events at a high rate (about 75% of days) to close others (romantic part- 
ners, parents, close friends, and siblings). 


Context 


Finally, daily experience methods allow the examination of how variations of context 
affect relationship processes, as well as how between-person factors may moderate these 
associations. Contextual factors, by far, are the focus of the majority of daily experience 
studies on close relationship processes. The bulk of these studies have examined varia- 
tions in contextual factors that are inherent to the relationship processes under examina- 
tion, such as examinations of conflict days versus no-conflict days (we discuss contextual 
factors external to the relationship in the following section). For example, in a 2-week 
daily experience study of the correlates of the decision to disclose or conceal one’s sexual 
orientation, lesbians and gay men reported greater well-being on the days they disclosed 
their sexual orientation to others than on the days they concealed their sexual orienta- 
tion from others (Beals, Peplau, & Gable, 2009). A daily experience study examining the 
association between alcohol consumption and unprotected sex over 30 days found that 
alcohol consumption proximal to sex with steady partners did not increase the likelihood 
of unprotected sex; however, alcohol consumption proximal to sex with casual partners 
increased the probability of unprotected sex (Kiene, Barta, Tennen, & Armeli, 2009). 

Shrout, Herman, and Bolger (2006) found that bar exam takers who reported receiv- 
ing practical support also reported decreased fatigue and increased vigor. Receiving emo- 
tional support, however, was associated with increases in anger and anxious—depressed 
mood the following day. In another example, Birnbaum, Reis, Mikulincer, Gillath, 
and Orpaz (2006) showed that attachment style can impact the outcomes of positive 
and negative sexual experiences. For instance, the authors found that for women with 
higher attachment anxiety, positive sex-related feelings were associated with even greater 
relationship-enhancing behaviors and relationship quality ratings on the day after they 
had sex, whereas following sexual experiences associated with negative feelings, the same 
highly anxious women showed even more negative responses the next day (e.g., lower 
relationship-enhancing behaviors, lower relationship quality). 


Promising Directions That Daily Experience Methods 
Can Take in Close Relationships Research 


The previous section represents only a small sample of recent close relationships research 
that has used daily experience methodologies. Although we believe all of our questions 
and theories would benefit from incorporating a within-person perspective, we feel that 
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there are three major areas or questions in close relationships research that are ripe for 
future attention. We outline these here and present examples of the few studies that hint 
at the magnitude of their promise. 

The first area of research that we believe holds major promise is work on daily rela- 
tionship processes as they show stability or change over long periods of time. Despite 
the fluctuations and transitions in relationships throughout their duration, studies that 
incorporate multiple daily experience samplings into long-term longitudinal studies are 
lacking. Studies such as these could address many questions regarding the association 
between day-to-day relationship processes and the patterns that unfold in the establish- 
ment, development, variability, and stability of relationships over time. For example, 
romantic relationships, which involve the interaction of two individuals’ beliefs, desires, 
and goals for themselves and their partners, could have different implications across vari- 
ous periods in the relationship (e.g., dating, marriage, parenthood, retirement). More- 
over, these beliefs, desires, and goals also change over time, which likely is at least in part 
due to the daily thoughts, feelings, and behaviors that constitute relationships. A study by 
Sprecher and Metts (1999), although not a daily experience study, illustrates the associa- 
tion between changing beliefs and relationship quality. In this longitudinal study of new 
dating partners, data were collected at five time points over 4 years. The authors found 
that individuals’ romantic beliefs about what relationships were supposed to be corre- 
lated with reports of their own relationship quality (love, satisfaction, and commitment) 
at Time 1; that is, positive beliefs about how relationships should be were associated with 
later increases in positive feelings and experiences in relationships. This finding, however, 
was not consistent over time; the authors found a general decline in romantic beliefs for 
couples who stayed together, unless their status changed from dating to engaged. These 
findings suggest that romantic beliefs and relationship quality may change together across 
time, with reciprocal paths of influence. 

Second, close relationships research needs to examine more carefully the environ- 
mental factors that lie largely outside of the relationship. As Berscheid (1999) eloquently 
argued, the context in which a relationship unfolds has strong effects on the course and 
quality of that relationship; however, as relationships researchers, we are prone to the 
fundamental attribution error, often ignoring this context and instead focusing on how 
factors within the person or the dyad influence the quality and course of relationship 
processes. Moreover, it is likely that factors within the person have a different effect con- 
tingent on the external context in which the relationship is situated. It is important, then, 
to examine the relationship context in order to understand behavior in relationships. The 
few studies that have focused on environmental factors have mostly examined stress. 
These investigations have yielded important findings. 

For example, Neff and Karney (2009), who conducted daily experience studies of 
couples, found that romantic partners’ reactivity to daily relationship experiences was 
stronger when they were experiencing greater than normal outside stress (e.g., work), 
and their specific relationship experiences were more closely integrated into their global 
satisfaction on days they experienced more external (e.g., work-related) stress. In another 
study, Story and Repetti (2006) found that wives reported greater anger and withdrawal 
on days they experienced a heavy workload, while husbands reported greater anger and 
withdrawal on days they experienced negative social interactions at work. In these stud- 
ies, the daily experience method was essential for examining how factors inside and 
outside the relationship interact, adding to our knowledge of both stress and coping, and 
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of relationship processes. Clearly these studies show that external stressors play a large 
role in relationship functioning, and depending on the relationship and environmental 
context, relationship processes may vary significantly. 

Last, variations in processes across different relationship partners are infrequently 
examined and are not well understood. Close relationships researchers generally focus 
on a particular relationship (e.g., marriage) or on a particular general phenomenon (e.g., 
rejection), neglecting the influence of variability across partners in relationships. Thus, 
we do not know, for example, whether the emotion patterns observed in spouses during 
conflict have any bearing on their patterns of emotion during conflict with their children. 
The variation across partners is likely very important, as illustrated by a now classic 
study by Miller and Kenny (1986). They showed that self-disclosure differs as a function 
of the unique relationship between two people, more so than how much individuals self- 
disclose or are disclosed to in general. In fact, people often adjusted their level of self- 
disclosure based on whom they disclose to. Similarly, in a study of autonomy in young 
adults’ relationship styles with mothers, fathers, best friends, and romantic partners, 
Neff and Harter (2003) found that most of the participants were not consistent in their 
relationship styles across these relationship partners. The self-focused autonomous style 
was often exhibited with parents, while the other-focused autonomous style was often 
exhibited with romantic partners. In short, the manner in which people interact with 
each other depends greatly on with whom they interact. Clearly, these studies showed 
that variability across partners and likely roles (e.g., parent, sibling, friend) is an impor- 
tant factor to consider in relationships theories and investigations. 


Concluding Comments 


Relationships are a vital part of psychological well-being and physical health (Reis et 
al., 2000). A central tenet in the field of personal relationships is that close relationships 
involve interdependence over some period of time, in various contexts, and that there is 
something special about a relationship that goes beyond the sum of each person’s charac- 
teristics. Thus, theories of close relationships and empirical investigations of relationship 
processes benefit from a within-person perspective that incorporates aspects of time, 
context, and partners. Moreover, the ubiquity of close relationships in daily life, and 
the difficulty of simulating many processes of interest in laboratory settings, means that 
carrying out studies in the field as they unfold in daily life is an ideal approach. Daily 
experience methods fit the bill perfectly. 


References 


Algoe, S., Gable, S. L., & Maisel, N. C. (2010). It’s the little things: Everyday gratitude as a booster shot 
for romantic relationships. Personal Relationships, 17, 217-233. 

Atkins, D. C. (2005). Using multilevel models to analyze couple and family treatment data: Basic and 
advanced issues. Journal of Family Psychology, 19(1), 98-110. 

Barkow, J., Cosmides, L., & Tooby, J. (Eds.). (1992). The adapted mind: Evolutionary psychology and 
the generation of culture. New York: Oxford University Press. 

Barnett, R. C., Raudenbush, S. W., Brennan, R. T., Pleck, J. H., & Marshall, N. L. (1995). Change in 
job and marital experiences and change in psychological distress: A longitudinal study of dual- 
earner couples. Journal of Personality and Social Psychology, 69, 839-850. 


Close Relationships 523 


Baumeister, R. F., & Leary, M. R. (1995). The need to belong: Desire for interpersonal attachment as a 
fundamental human motivation. Psychological Bulletin, 117(3), 497-539. 

Beals, K. P., Peplau, L. A., & Gable, S. L. (2009). Stigma management and well-being: The role of per- 
ceived social support, emotional processing, and suppression. Personality and Social Psychology 
Bulletin, 35(7), 867-879. 

Berscheid, E. (1999). The greening of relationship science. American Psychologist, 54, 260-266. 

Birnbaum, G. E., Reis, H. T., Mikulincer, M., Gillath, O., & Orpaz, A. (2006). When sex is more than 
just sex: Attachment orientations, sexual experience, and relationship quality. Journal of Person- 
ality and Social Psychology, 91(5), 929-943. 

Blieszner, R. (2006). A lifetime of caring: Dimensions and dynamics in late-life close relationships. 
Personal Relationships, 13, 1-18. 

Bowlby, J. (1969). Attachment and loss: Vol. 1. Attachment. New York: Basic Books. 

Cacioppo, J. T., & Patrick, B. (2008). Loneliness: Human nature and the need for social connection. 
New York: Norton. 

Cohen, S. (1988). Psychosocial models of the role of social support in the etiology of physical disease. 
Health Psychology, 7(3), 269-297. 

Coyne, J. C., Rohrbaugh, M. J., Shoham, V., Sonnega, J. S., Nicklas, J. M., & Cranford, J. A. (2001). 
Prognostic importance of marital quality for survival of congestive heart failure. American Jour- 
nal of Cardiology, 88, 526-529. 

Csikszentmihalyi, M., & Larson, R. (1983). The experience sampling method. New Directions for 
Methodology of Social and Behavioral Science, 15, 41-56. 

Deci, E. L., & Ryan, R. M. (1985). Intrinsic motivation and self-determination in human behavior. 
New York: Plenum Press. 

Diener, E., & Seligman, M. E. P. (2002). Very happy people. Psychological Science, 13(1), 81-84. 

Ducharme, J., Doyle, A. B., & Markiewicz, D. (2002). Attachment security with mother and father: 
Associations with adolescents’ report of interpersonal behavior with parents and peers. Journal of 
Social and Personal Relationships, 19(2), 203-231. 

Eng, P. M., Kawachi, I., Fitzmaurice, G., & Rimm, E. B. (2005). Effects of marital transitions on 
changes in dietary and other health behaviours in US male health professionals. Journal of Epi- 
demial Community Health, 59, 56-62. 

Feeney, B. C., & Collins, N. L. (2001). Predictors of caregiving in adult intimate relationships: An 
attachment theoretical perspective. Journal of Personality and Social Psychology, 80, 972-994. 

Fincham, F. D., Paleari, F. G., & Regalia, C. (2002). Forgiveness in marriage: The role of relationship 
quality, attributions, and empathy. Personal Relationships, 9, 27-37. 

Fisher, P. A., & Kim, H. K. (2007). Intervention effects on foster preschoolers’ attachment-related 
behaviors from a randomized trial. Prevention Science, 8(2), 161-170. 

Gable, S. L., & Reis, H. T. (1999). Now and then, them and us, this and that: Studying relationships 
across time, partner, context, and person. Personal Relationships, 6, 415-432. 

Gable, S. L., Reis, H. T., & Downey, G. (2003). He said, she said: A quasi-signal detection analysis of 
spouses’ perceptions of everyday interactions. Psychological Science, 14, 100-105. 

Gable, S. L., Reis, H. T., Impett, E., & Asher, E. R. (2004). What do you do when things go right?: 
The intrapersonal and interpersonal benefits of sharing positive events. Journal of Personality and 
Social Psychology, 87, 228-245. 

Harlow, R. E., & Cantor, N. (1995). To whom do people turn when things go poorly?: Task orientation 
and functional social contacts. Journal of Personality and Social Psychology, 69(2), 329-340. 

House, J. S., Landis, K. R., & Umberson, D. (1988). Social relationships and health. Science, 241, 
540-545. 

Impett, E. A., Strachman, A., Finkel, E. J., & Gable, S. L. (2008). Maintaining sexual desire in intimate 
relationships: The importance of approach goals. Journal of Personality and Social Psychology, 
94(5), 808-823. 

Kashy, D. A., Donnellan, M. B., Burt, S. A., & McGue, M. (2008). Growth curve models for indistin- 
guishable dyads using multilevel modeling and structural equation modeling: The case of adoles- 
cent twins’ conflict with their mothers. Developmental Psychology, 44(2), 316-329. 

Kelley, H. H., & Thibaut, J. W. (1978). Interpersonal relations: A theory of interdependence. New 
York: Wiley. 

Kenny, D. A., Kashy, D. A., & Cook, W. L. (2006). Dyadic data analysis. New York: Guilford Press. 


524 RESEARCH APPLICATIONS 


Kiecolt-Glaser, J. K., Gouin, J., & Hantsoo, L. (2010). Close relationships, inflammation, and health. 
Neuroscience and Biobehavioral Review, 35, 33-38. 

Kiecolt-Glaser, J. K., Loving, T. L., Stowell, J. R., Malarkey, W. B., Lemeshow, S., Dickinson, S. L., et 
al. (2005). Hostile marital interactions, proinflammatory cytokine production, and wound heal- 
ing. Archives of General Psychiatry, 62(12), 1377-1384. 

Kiene, S. M., Barta, W. D., Tennen, H., & Armeli, S. (2009). Alcohol, helping young adults to have 
unprotected sex with casual partners: Findings from a daily diary study of alcohol use and sexual 
behavior. Journal of Adolescent Health, 44(1), 73-80. 

Laurenceau, J.-P., & Bolger, N. (2005). Using diary methods to study marital and family processes. 
Journal of Family Psychology, 19, 86-97. 

Milardo, R. M., Johnson, M. P., & Huston, T. L. (1983). Developing close relationships: Changing pat- 
terns of interaction between pair members and social networks. Journal of Personality and Social 
Psychology, 44(5), 964-976. 

Miller, L. C., & Kenny, D. A. (1986). Reciprocity of self-disclosure at the individual and dyadic levels: 
A social relations analysis. Journal of Personality and Social Psychology, 50, 713-719. 

Neff, K. D., & Harter, S. (2003). Relationship styles of self-focused autonomy, other-focused con- 
nectedness, and mutuality across multiple relationship contexts. Journal of Social and Personal 
Relationships, 20(1), 81-99. 

Neff, L. A., & Karney, B. R. (2009). Stress and reactivity to daily relationship experiences: How stress 
hinders adaptive processes in marriage. Journal of Personality and Social Psychology, 97, 435- 
450. 

Onrust, S. A., & Cuijpers, P. (2006). Mood and anxiety disorders in widowhood: A systematic review. 
Aging and Mental Health, 10(4), 327-334. 

Reis, H. T., Collins, W., & Berscheid, E. (2000). The relationship context of human behavior and devel- 
opment. Psychological Bulletin, 126, 844-872. 

Reis, H. T., & Gable, S. L. (2000). Event sampling and other methods for studying daily experience. In 
H. T. Reis & C. M. Judd (Eds.), Handbook of research methods in social and personality psychol- 
ogy (pp. 190-222). New York: Cambridge University Press. 

Reis, H. T., & Wheeler, L. (1991). Studying social interaction with the Rochester Interaction Record. 
In L. Berkowitz (Ed.), Advances in experimental social psychology (pp. 269-318). New York: 
Academic Press. 

Rutter, M. (1979). Material deprivation: 1972-1978: New findings, new concepts, new approaches. 
Child Development, 50(2), 283-305. 

Sbarra, D. A., & Emery, R. E. (2005). The emotional sequelae of nonmarital relationship dissolu- 
tion: Analysis of change and intraindividual variability over time. Personal Relationships, 12(2), 
213-232. 

Segrin, C., & Passalacqua, S. A. (2010). Functions of loneliness, social support, health behaviors, and 
stress in association with poor health. Health Communication, 25(4), 312-322. 

Shiffman, S., & Stone, A. A. (1998). Introduction to the special section: Ecological momentary assess- 
ment in health psychology. Health Psychology, 17, 3-5. 

Shrout, P. E., Herman, C. M., & Bolger, N. (2006). The costs and benefits of practical and emotional 
support on adjustment: A daily diary study of couples experiencing acute stress. Personal Rela- 
tionships, 13(1), 115-134. 

Sprecher, S., & Metts, S. (1999). Romantic beliefs: Their influence on relationships and patterns of 
change over time. Journal of Social and Personal Relationships, 16(6), 834-851. 

Story, L. B., & Repetti, R. (2006). Daily occupational stressors and marital behavior. Journal of Family 
Psychology, 20(4), 690-700. 

Wheeler, L., & Reis, H. T. (1993). Self-recording of everyday life events: Origins, types, and uses. Jour- 
nal of Personality, 59, 339-354. 


CHAPTER 29 


Personality Research 


WILLIAM FLEESON 
ERIK E. NOFTLE 


O ur purpose in this chapter is to articulate the kinds of conceptual questions about 
personality that experience sampling methodology (ESM) and other daily methods 
are uniquely suited to address. Because of its ability to assess what is actually happening 
in the moment, on multiple occasions, ESM opens up new questions about the trait- 
behavior relationship and the manifestation, the dynamics, and the inner workings of 
personality. In other words, how much and in what ways do people’s traits influence 
behavior? How do people’s personalities and behavior vary across occasions and time? 
How do people’s traits function, and according to which internal processes? 

There at least three reasons we are pursuing this purpose. First, the kinds of con- 
ceptual questions ESM can address are centrally important for personality psychology; 
the ability of ESM to address those important questions may catalyze new progress on 
those important questions. Second, although ESM is an exciting and promising method, 
it is important to show that it is also conceptually relevant. We believe that ESM indeed 
sheds light on big conceptual questions, and in particular, on questions that are hard 
to illuminate otherwise. Third, it is important to articulate explicitly the links between 
methods and conceptual questions the methods can address. It is often easy to under- 
stand a method, or to understand a conceptual question, but it is often harder to draw 
the lines that connect a method with a question. It seems to us that ESM and other daily 
methods, while growing in use, are not growing quickly enough in personality research. 
Articulating the important conceptual questions addressable by ESM may help spread the 
use of ESM in personality research. 

In the context of this chapter, we assume that readers are familiar with ESM and 
other daily measures (see Conner & Lehman, Chapter 5, this volume). However, for com- 
pleteness, by ESM we denote a method that allows for repeated measurement of behavior 
states within the context of individuals’ everyday lives.! Several excellent papers have 
been written about the advantages of ESM measures over alternative methods in general 
(in this volume, see Reis, Chapter 1; Schwarz, Chapter 2; Hamaker, Chapter 3). Although 
we mention them here where appropriate, our purpose in this chapter is not to reiterate 
those advantages. Although ESM certainly has contributed a great deal of understanding 
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to the connection between relationship variables and behavior (Gable, Gosnell, & Prok, 
Chapter 28, this volume), and affective traits and experienced affect (e.g., Augustine & 
Larsen, Chapter 27, this volume), we wish to locate this chapter more centrally in the 
conceptual domain of personality by articulating five specific conceptual questions that 
are relevant to personality and are addressable by ESM. 


How Are Broad, 
Latent Personality Characteristics Manifested? 


The first important conceptual question in personality psychology concerns the connec- 
tion between latent constructs of personality and their manifestation in behavior. Early 
definitions of personality argued that behavior deserves a central role in personality. For 
example, Allport’s (1961) definition of personality from Pattern and Growth in Personal- 
ity (the revised edition of his classic 1937 volume) postulated that personality determines 
an individual’s “characteristic behavior” (p. 28). In fact, traits have even been equated 
with summaries of behavior (Hampshire, 1953), such as in the act frequency and density 
distributions approaches (Buss & Craik, 1983; Fleeson, 2001). In contrast, a common 
criticism of traits is that they don’t appear to relate to actual behavior (Mischel, 1968; 
Ross & Nisbett, 1991). 

Given that the study of behavior is important for an understanding of personality, 
the question becomes how researchers can best study behavior to reveal information 
about a person that is as representative and comprehensive as possible. There have been 
three behavior measurement strategies in answering this question: studying behavior in 
time-limited laboratory or field studies, studying behavior as it is reflected in self-report 
(or informant-report) personality questionnaires, and studying behavior using ESM. 
Laboratory and field studies provide only a few still frames depicting people’s behavior 
from within the movies of their lives. These stills may contain crucial scenes or irrelevant 
details, and it is sometimes difficult to know which is which, without knowing more 
about the movie. A set of scores from a personality questionnaire, on the other hand, 
could be said to be more like a movie poster. It captures behavior that is centrally impor- 
tant and likely representative for the character(s) depicted because it has been carefully 
chosen to capture the essence of the movie. 

In addition to behavior measurement strategies of direct observation and personality 
questionnaires, ESM is also well suited to assessing how individuals behave. However, 
ESM boasts additional benefits in that it allows researchers to address coverage, repre- 
sentativeness, and ecological validity more fully than have previous strategies. ESM also 
permits study of behavior in a more complex way: Actual behavior is tracked across 
many contexts and a reasonably representative time span within the “natural habitat” 
of a person (De Vries, 1992; Furr, 2009). In so doing, ESM resembles something like a 
movie preview. The viewer certainly does not have the benefit of seeing the entire movie, 
but does get a more reasonable representation of the character(s) within different contexts 
and across time than would be obtained from a movie poster or a small set of stills from 
the movie. ESM also benefits from (1) immediate versus retrospective reporting, (2) spe- 
cific versus summary accounts of behavior, (3) recruitment of episodic versus semantic 
memory, (4) good potential for high reliability and power, and (5) inclusion of multiple 
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contexts, all of which are relevant to basic questions in personality psychology. When it 
comes to personality, what the person actually does, thinks, and feels is at least as impor- 
tant to know as what he or she remembers doing, thinking, and feeling on average (or 
what self-concept he or she is able to construct from these memories). 

One of the main advantages of ESM is its promise to measure (one form of) actual 
behavior (Furr, 2009). In 2007, Baumeister, Vohs, and Funder argued that social and per- 
sonality psychology have neglected the study of behavior, even though the fields assume 
that the study of important social behaviors is a central, paradigmatic goal. Furr (2009), 
in a target article on the importance of behavior within personality psychology, conducted 
a content analysis of methods from samples of two major personality journals over the 
last decade. Furr found that less than 2% of these published personality articles used 
self-reports in ESM to study behavior, which, he argued, is one of the three best methods 
for studying actual behavior, along with direct observation and acquaintance reports. 
For example, Wu and Clark (2003) asked participants to indicate which of 55 concrete 
behaviors (e.g., “Was purposely rude, insulting to someone”) they had performed on a 
given day for 14 consecutive days. 

Manifestations of personality variables are not limited just to concrete, physical 
actions, but include actions, cognitions, motivations, and emotions (forming the acro- 
nym ACME; Fleeson & Noftle, 2008; Pytlik Zillig, Hemenover, & Dienstbier, 2002). For 
each of these elements, ESM allows researchers to address questions in personality stud- 
ies about what actually happens. The personality state concept, having been designed to 
assess all four of these psychological modes of manifestation at once (Cattell, Cattell, & 
Rhymer, 1947; Fleeson, 2001), goes further, so that the entire content of the trait can be 
assessed in the moment. Personality states describe how much a person is manifesting a 
given trait at any moment, in any or all of its modes, and enables testing of the actual 
appearance of personality variables in life. 

ESM studies have already provided some examples of the relationship of latent per- 
sonality constructs to their manifestations. Fleeson and Gallagher (2009) showed that the 
actual manifestation of traits in behavior is highly correlated with trait scores, meaning 
that even in the face of real pressures and real demands of situations, people tend to act at 
their trait level over time. Using the clever EAR technology, Mehl (see Mehl & Robbins, 
Chapter 10, this volume) revealed that traits are manifest in environmental sounds, such 
as language use patterns, types of conversations, and emotion expressions (e.g., laugh- 
ing and crying). Glasser, van Os, Mengelers, and Myin-Germeys (2008), Ebner-Priemer 
and Trull (Chapter 23, this volume), and Zeigler-Hill and Abraham (2006) showed that 
borderline personality disorder is manifest in affective and self-esteem lability. Kwapil et 
al. (2009) showed that schizophrenia spectrum disorders are manifest in social interac- 
tions. Stone and colleagues (1998) showed that coping styles are manifest in response to 
the events of daily life. Wood, Quinn, and Kashy (2002) showed that habits are manifest 
as nearly automatic behaviors, accompanied by unrelated thoughts and lowered stress. 
Torquati and Raffaelli (2004) showed that attachment styles are manifest in time spent 
and emotions experienced with others. Skitka, Bauman, and Sargis (2005) showed that 
moral conviction is manifest in social distance from attitudinally dissimilar others, and 
Bardi and Schwartz (2003) showed that values are manifest in corresponding behaviors. 
Similar findings can be revealed about manifestation of other latent personality con- 
structs. 
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How Do Variability and Stability Coexist 
in Personality Manifestation? 


Personality questionnaires allow the researcher to compare people in terms of their aver- 
age ways of being. Questionnaires provide scores that compare persons, just as tempera- 
ture averages compare climates of different cities. However, personal behaviors are not 
only averages; people also respond adaptively to situations, actively pursue goals, and 
follow temporal trends. This aspect of persons takes into account variations, which also 
is crucial information for making comparisons. For example, Saint Louis and San Fran- 
cisco have remarkably similar annual mean temperatures (57° vs. 56° F, respectively). 
However, Saint Louis is about 20 degrees colder than San Francisco in January, and 
about 20 degrees warmer than San Francisco in July (whether comparing daily highs or 
lows). In fact, it turns out that Saint Louis is simply much more variable in temperature 
than San Francisco. Similarly, since both views of persons—averages and variabilities— 
provide interesting information, it is important that personality psychologists be able to 
study both. It is desirable to be able to assess the stable and the variable aspects of per- 
sonality, to compare the two aspects, and to relate them to each other. 

But how can personality be both stable and variable? Lay perceptions of behavior 
seem to suggest two paradoxical accounts of personality. For example, people expect 
their friends, colleagues, and romantic partners to behave in generally stable ways, and 
they also sometimes expect this of themselves. If a typically quiet and shy friend attends a 
dinner party and acts boisterously and garrulously, he is seen as acting “out of character,” 
evoking surprise. However, when individuals introspectively examine their own behav- 
ior, they are sometimes able to recall instances in which they were quiet and instances 
in which they were talkative, which suggests that behavior is at least somewhat variable. 
We believe that both these perceptions are accurate. How can such opposing, apparently 
paradoxical views of personality be accurate? How can people’s recollections of their 
own behavior suggest a large amount of variability, whereas their views of people in gen- 
eral suggest so much stability? 

One answer to the paradox might be that both perceptions of stability and variability 
capture the reality of a person. Additionally, different aspects of a person’s behavior may 
be differentially salient depending on whether the observer is taking an outside, global 
perspective or an inside, detailed perspective. When asked to think about what people are 
like in general, observers might take an outside perspective, and so might focus on aver- 
age behaviors, but when taking an inside perspective, observers pay attention to specific 
circumstances, and so might focus on variability in behavior. 

ESM is well suited to a resolution of the paradox because it allows researchers to 
study both aspects of personality. Multiple ratings of personality states obtained through 
ESM allows for both stability and variability: Averaging across occasions allows an easy 
aggregate to be formed, and, in contrast, maintaining the varied occasions allows vari- 
ability to be studied. For example, in the density distributions approach (Fleeson, 2001), 
multiple assessments of states are organized for each individual into a distribution of 
frequencies in manifestations of each level of each Big Five state. The idea of the density 
distributions approach is to expand the definition of traits to include not just a person’s 
average way of behaving but also a person’s range of behavior, including the amount of 
variability. In other words, instead of a trait being captured by a single number represent- 
ing typical behavior, a trait is captured by several numbers, including a mean (one’s aver- 
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age behavior), a standard deviation (one’s amount of variability), and amounts of skew 
and kurtosis (one’s shape of behavior distributions). Fleeson demonstrated large amounts 
of stability in behavior: Individuals are remarkably similar from one week to another in 
their average levels of Big Five states, with correlations around .90. At the same time, 
Fleeson demonstrated that intraindividual variability of trait-relevant behavior is quite 
high: People who are highly extraverted often acted introverted, and people who are 
highly introverted often act extraverted (Baird, Le, & Lucas, 2006; Fleeson, 2001). This 
was evident visually in the wide span of the individuals’ distributions. 

High levels of stability and variability are typical not only in undergraduate sam- 
ples for Big Five personality states. High stability and variability have been found in the 
self-concepts and Big Five states of middle-aged and older adults (Charles & Pasupathi, 
2003; Noftle & Fleeson, 2010), in the traits of the interpersonal circumplex (Fournier, 
Moskowitz, & Zuroff, 2008), in self-esteem (Heppner et al., 2008; Moneta, Schneider, & 
Csikszentmihalyi, 2001), and in affiliation motivation (O’Connor & Rosenblood, 1996). 
Thus, the coexistence of the stable, personality-driven behavior pattern and the variable, 
responsive behavior pattern may be the rule. 


Assessing Several Types of Consistency in Manifestation 


A third conceptual question of importance in personality builds on the preceding ideas. 
This question is what is meant by consistency, and what the degree of different concep- 
tions or types of consistency imply about the nature of traits. We have previously defined 
consistency as the degree to which one person’s manifestations of a given personality 
characteristic are similar to that of other manifestations (Fleeson & Noftle, 2008). 
Consistency in behavior is critical to the existence of personality. Any amount of any 
type of consistency is both sufficient and necessary for the existence of personality (All- 
port, 1937). Consistency in behavior is sufficient for the existence of personality, because 
consistency shows that something within the person is responsible for the individual dif- 
ferences in the behavior; that is, if individuals differ in some behavior, and those differ- 
ences repeat themselves on another occasion, then something enduring in those individu- 
als must have been responsible for their behavior. That enduring something, whatever it is, 
is part of personality. Consistency in behavior is necessary for the existence of personality, 
because a lack of consistency shows that transitory factors, rather than enduring factors, 
are responsible for the individual differences in the behavior; that is, if individuals differ in 
some behavior, but those differences are randomly rearranged on another occasion, then 
there is nothing in those individuals that led to the differences in their behavior. 
However, one of the early discoveries in the person-situation debate was the exis- 
tence of multiple forms of consistency. The situation side of the debate was emphasizing 
cross-situational (in)consistency (Mischel & Shoda, 1998), whereas the person side of the 
debate was emphasizing a generalized form of consistency, without regard for the situ- 
ations in which the behaviors occurred (Epstein, 1979). In particular, the situation side 
argued that behavioral consistency was low because the correlation across individuals 
of behavior in one situation with their behavior in a different situation was low (cross- 
situational consistency). The person side argued that behavioral consistency was in fact 
high, because the correlation between aggregated behaviors across many situations dur- 
ing one time span and aggregated behaviors across many situations from a different time 
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span was high. Other theorists discovered other types of consistency, resulting in about 
seven types of theorized consistency (Block, 1971; Caspi, 1987; Epstein, 1979; Ozer, 
1986; Mischel & Shoda, 1998; Sroufe, 1979). 

Thus, not only the amount but also the type of consistency is important (Fleeson & 
Noftle, 2008). Different types of consistency imply different conceptions of personality, 
and different research strategies are needed to explain personality. For example, behavior 
that is temporally stable but not cross-situationally consistent implies that personality 
should be conceived of as if-then profiles or signatures (Mischel & Shoda, 1998). Behav- 
ior that is temporally stable but not consistent across different types of behaviors implies 
that personality consists of hundreds of narrow habits but no broad traits such as the Big 
Five. 

By analyzing the components underlying the basic question of consistency, and then 
organizing the components into a framework, we were able to identify several more types 
of consistency, creating a supermatrix of 36 types of consistency (Fleeson & Noftle, 
2008). All types of consistency address the following formulation of the basic question of 
consistency: Does the enactment of behavior by the same person remain similar across 
variations in competing determinants of behavior? However, defining enactment in any 
of four ways, defining similarity in any of three ways, and by defining the competing 
determinant in any of three ways, this question can be asked in 4 x 3 x 3 = 36 different 
ways. 

ESM is well suited for investigating each of these types of consistency and assessing 
the degree to which each type is present in personality. ESM’s primary feature—multiple 
assessments per person—allows assessment of enough occasions per person to determine 
whether states are consistent across occasions. ESM is specifically suited to the question 
of multiple types of consistency because it allows the researcher to assess different defini- 
tions of behavior, to compute multiple forms of similarity, and to simultaneously assess 
several competing determinants of behavior. Given that consistency is so important to 
the existence and nature of personality, employing ESM to address each of the types of 
consistency 1s paramount. 

For example, when enactment of behavior is defined as a single action, consistency 
is moderate, but when enactment is defined as averages of behavior across a few days, 
consistency is very high (Epstein, 1979; Fleeson, 2001). Consistency of behavior when 
enactment is defined as averages was shown to be high across competing determinants of 
both situation and time (Moskowitz, 1994), even across cultures (Church et al., 2008). 
Behavior enactment defined more complexly as the person’s amount of variability in 
behavior over time was also shown to be moderately consistent (Fleeson, 2001). A still 
more complex definition of behavior enactment is as a contingency, in which behavior is 
manifested under certain conditions but not others. Individual differences in such contin- 
gencies were found and demonstrated to be consistent across time both for the behaviors 
of the interpersonal circumplex (Fournier et al., 2008) and for the behaviors of the Big 
Five (Fleeson, 2007). Each of these types of consistency reveals information about the 
nature of the underlying personality variables that produced the behaviors. 


Assessing Within-Person Structures of Personality 


Another important conceptual question in personality psychology is how the many and 
diverse elements of personality within each individual are related to each other. Allport 
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(1937) often argued that personality is the arrangement of and interaction of personality 
variables, and that such arrangements and interactions might be universal or differ across 
people, but that in any case they need to be the focus of personality research. 

This is an important conceptual question in personality research for several reasons. 
First, structure to date has been studied mainly by investigating between-person differ- 
ences in levels of personality components. Although important, such research does not 
explain the psychological functioning of individuals, which can be accomplished only by 
investigating the structure of components within a personality (Conner, Tennen, Fleeson, 
& Feldman Barrett, 2009). Second, this is an interesting question because it leads to 
a major test of the applicability of the five-factor structure of personality (Saucier & 
Goldberg, 1998) or the HEXACO model (Honesty-Humility, Emotionality, Extraver- 
sion, Agreeableness, Conscientiousness, and Openness to Experience; Ashton & Lee, 
2007). If the traits of these models do not describe the ways traits are structured within 
people, but only the ways individual differences are structured in the population, then the 
applicability of such models as descriptions of psychological functioning of individuals 
is severely limited. 

A third reason this is an important question is that the structural relationships may 
be different for different individuals. For example, extraversion may be a universal dimen- 
sion that applies to all individuals, and on which individuals differ only in their level. In 
contrast, individuals might differ in how much the construct of extraversion even applies 
to them as individuals. Sometimes known as “traitedness” (Reise & Waller, 1993), the 
concept of individuals differing in the degree to which a trait applies to them tends to 
be based on the idea that the components of a given trait may not combine in the same 
way for different individuals. This is one of the more central concepts within idiographic 
research because, according to this notion, different individuals cannot even be properly 
or fully studied with the same variables. 

ESM is well suited to address this issue, because its assessment of personality mani- 
festations on repeated occasions suggests a way to ascertain the relationship between 
components of personality within an individual. ESM suggests that a dimension applies 
to an individual to the degree that manifestation of the subcomponents of the dimension 
are correlated within that person across occasions (e.g., talkativeness and assertiveness 
are two subcomponents of extraversion). If manifestations of two subcomponents of a 
dimension are correlated within a person across occasions, there are only a few possible 
explanations: The two subcomponents contain overlapping meaning for that individual; 
one of the subcomponents causes the other for that individual, so that they co-occur in 
behavior; or a third variable within that person causes both subcomponents. For exam- 
ple, a relationship between talkativeness and assertiveness exists for an individual if he 
or she is talkative more often when he or she is assertive, and can be explained only if the 
two have overlapping meaning, or if talkativeness causes assertiveness or assertiveness 
causes talkativeness for that individual, or if a third variable causes the individual to be 
both talkative and assertive whenever the third variable is active, for example, if they are 
activated in service of the same goal. So a dimension applies to an individual to the extent 
that its subcomponents are related to each other for a given individual in the same way 
that they are related to each other for people in general. In future research, specific atten- 
tion should be given to subcomponents of traits such as the facets of the five-factor model 
(McCrae & Costa, 2008) or the Big Five aspects (De Young, Quilty, & Peterson, 2007), 
in order to determine whether components of broad traits are differentially structured 
within individuals. 
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As discussed by Brose and Ram (Chapter 25, this volume), p-factor analysis can 
identify entire structures that are specific to individuals. These structures can be com- 
pared across individuals to see whether there are general or common dimensions (All- 
port, 1937), and the degree to which trait structures differ across individuals (Cattell et 
al., 1947; Cramer, Waldorp, van der Maas, & Borsboom, 2010; Hooker, Nesselroade, 
Nesselroade, & Lerner, 1987). Borkenau and Ostendorf (1998) published the only exam- 
ple of a P-technique factor analysis of the Big Five personality trait structure of which 
we are aware. They found that the covariations across participants were well described 
by the Big Five, but at the same time, many individuals revealed different structures. The 
researchers were not able to tell whether these differences were due to measurement error 
or to genuine psychological differences between people in their dimensions. However, 
Beckmann, Wood, and Minbashian (2010) examined the within-person relationship 
between two of the five traits and found that although Conscientiousness and Neuroti- 
cism were negatively correlated across individuals, they were positively correlated within 
individuals. Perhaps Conscientiousness and Neuroticism have a functional relationship, 
which is revealed when within-person structures are assessed. 


Investigating Processes Underlying Personality Variables 


A fifth important conceptual question is what mechanisms underlie personality vari- 
ables, especially the traits of the Big Five. That is, what causes people to have their given 
standing on a trait; what causes people with given standings to enact relevant behaviors, 
feelings, and thoughts; and what sustains people’s standings over time? The mechanism 
would describe the step-by-step process that ends in the manifestation of a trait in actual- 
ity. Thus, identifying process is about predicting manifestation. 

This is an important question for personality psychology. By identifying processes, 
personality psychology can move from a descriptive science to an explanatory science. 
The trait perspective has clarified the descriptive side of traits by organizing ways to 
describe people’s personalities (McCrae & Costa, 2008). Nonetheless, although the field 
has witnessed a plethora of reasonable definitions of traits (Allport, 1937; Matthews, 
Deary, & Whiteman, 2003), no scientifically accepted theory of traits goes much beyond 
the definitions (Fleeson, 2012; Matthews et al., 2003). As a result, very little is known 
about how traits produce behavior, why persons differ in their traits, what traits describe 
about behavior, or how traits work. The Big Five personality traits are sorely in need of 
explanation. 

ESM is well suited to identification of processes underlying personality variables. 
The researcher can assess the degree of manifestation of the trait or another personality 
variable in the moment. Manifestation varies across occasions, and the researcher can try 
to predict that variability. The researcher hypothesizes a cause or causes of manifesta- 
tion, assesses the presence of that cause in each of the moments, and then tests the degree 
of covariation between the hypothesized cause and the manifestation. Significant cova- 
riation indicates identification of a candidate cause of manifestation of that trait. Focused 
follow-up experiments can then ascertain the causal status of the hypothesized cause. 

We suggest that this procedure identifies the processes underlying traits because the 
processes underlying traits are what determine manifestation. Processes describe the tem- 
poral covariations in the inputs and outputs that explain the psychological functioning 
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of the individual (e.g., the temporal covariation between the presence of other people and 
acting extraverted). Processes can be described with simple linear regression coefficients, 
or with complex functions such as curvilinear functions, sine waves, spectral analysis, 
factor analyses, and so on (Larsen, 1989): Any technique that can be conducted across 
individuals can also be conducted within individuals on these data. 

Within-person analyses are not strictly necessary for identifying processes— 
between-person analyses can stand in as proxies (Fleeson, 2007; Hamaker, Chapter 3, 
this volume). However, they often are not as effective, and they often provide different 
answers. Between-person analyses correlate the average level of manifestation with the 
average level of a hypothesized cause. They are less effective because the average levels 
include the sediment of many different processes that have operated over a long period 
of time, whereas the momentary manifestations represent a smaller number of processes 
activated at a time, and with wider variation, allowing more fidelity in aiming directly at 
what the researcher wants to assess. Furthermore, within-person methods bypass purely 
fixed individual differences such as social desirability or fixed features of the individual, 
because each individual acts as his or her own control. 

For example, Fleeson (2007) showed that current manifestation of Big Five traits is 
correlated with current situation properties: People act more extraverted when others are 
friendlier and the situation is less anonymous, and more conscientious when the situation 
is task-oriented. Fleeson demonstrated that the mechanisms underlying the Big Five are at 
least partly situation-based. Fournier and colleagues (2008) conducted a similar analysis, 
but they used the traits of the interpersonal circumplex and if-then profiles. Bleidorn 
(2009) and Heller, Komar, and Lee (2007) used a similar strategy to demonstrate that 
the mechanisms underlying the Big Five were at least partly motivation-based. At a more 
specific trait level, Lucas, Le, and Dyrenforth (2008) used ESM to provide evidence that 
sociability is not the core process underlying extraversion. Minbashian, Wood, and Beck- 
man (2010) used ESM to provide evidence for the task-based contingencies underlying 
Conscientiousness, and Conway, Rogelberg, and Pitts (2009) used ESM to provide evi- 
dence for the affect-based mechanisms underlying Agreeableness behavior. 

A particularly exciting potential outcome is that this line of research could link the 
descriptive Big Five (and HEXACO; Ashton & Lee, 2007) models with the explana- 
tory social-cognitive perspective on personality, which has made important theoretical 
advances, proposing potential explanatory mechanisms underlying individual differences 
in behavior. In 1937, Allport described traits as made up of cognitive, motivational, and 
affective processes that produce trait-manifesting behavior. Mischel and Shoda (1998) 
provided a specifically social-cognitive version of this explanatory account. As useful and 
compelling as Allport’s theory and Mischel and Shoda’s cognitive-affective personality 
system (CAPS) are as explanatory accounts, they lack an organizational framework to 
determine which important traits should be included in a model. The theories can explain 
why people differ, but the theories have not yet explicitly identified how people differ, or 
which individual differences the theories should be used to explain (with some notable 
exceptions; e.g., rejection sensitivity: Romero-Canyas, Anderson, Reddy, & Downey, 
2009; narcissism: Morf & Rhodewalt, 2001). 

Whole trait theory (Fleeson, 2012) proposes that joining the two perspectives 
together has great advantages: The strength of each perspective precisely addresses the 
weakness of the other. ESM would allow joining together the two perspectives. When 
processes are discovered that explain manifestation of Big Five traits, explanatory accounts 
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of the sort proposed by social-cognitive mechanisms can be identified. For example, when 
processes relating situation properties to Big Five trait manifestations are discovered, an 
explanatory social-cognitive mechanism is identified. What is important here is that the 
outcome variable, the personality manifestation, represents the descriptive side of traits 
because the manifestation is expressed in the same terms as the trait; that is, it is precisely 
the Big Five traits that are being explained by the social-cognitive mechanisms. Thus, the 
explanatory and descriptive sides would be directly and straightforwardly linked. Heller 
and colleagues (2007) have presented another example of a theoretical model integrating 
explanatory models with descriptive trait models. Their model proposes that social roles 
elicit different types of goals, which in turn elicit personality states. 

Finally, individuals may even differ in the processes themselves. The notion that indi- 
viduals may differ even in the processes that describe them, a very exciting development 
in personality psychology, raises personality to another level of abstraction, complex- 
ity, and sophistication. When personality psychologists study processes, they typically 
assume that the processes are common to all individuals (Pervin, 2003) and explain indi- 
vidual differences in the levels on the dimensions. But it is possible that people differ from 
each other even in the processes themselves (Allport, 1937; Fleeson, 2007). Because ESM 
collects large amounts of data for each person, it is also possible to calculate the degree to 
which each individual deviates from the average process and, in fact, to test whether the 
deviations across individuals are greater than would be expected by chance. 


Conclusion 


We have identified five types of important conceptual questions in personality. These con- 
ceptual questions either necessitate ESM or are greatly facilitated through the application 
of the method. These questions are certainly not the only ones for which ESM is useful; 
indeed, there is an abundance of potential and important questions. ESM can be used 
to provide new answers to questions about development across the lifespan (Hooker & 
McAdams, 2003; Noftle & Fleeson, 2010). It can be used to validate individual-differences 
constructs, such as implicit self-esteem (Conner & Feldman Barrett, 2005), knowledge 
structures (Turan & Horowitz, 2010), desired affect (Augustine, Hemenover, Larsen, & 
Shulman, 2010), or inspiration (Thrash & Elliot, 2003). ESM can be used to assess the 
role of personality in important behaviors, such as substance abuse (Chakroun, Johnson, 
& Swendsen, 2010), flow (Wong & Csikszentmihalyi, 1991), school performance (Wong 
& Csikszentmihalyi, 1991), prosocial behavior (Ilies, Scott, & Judge, 2006), and cardio- 
vascular functioning (Kamarck et al., 2005; Ode, Hilmert, Zielke, & Robinson, 2010). 
ESM can be used to assess moderator effects of personality on processes such as narrative 
development (McLean & Pasupathi, 2006), alcohol use (Armeli, Todd, & Mohr, 2005), 
emotion suppression (Chapman, Rosenthal, & Leung, 2009), self-focused attention 
(Field, Joudy, & Hart, 2010), and religiosity (Steger & Frazier, 2005). ESM is important in 
addressing idiographic questions (Conner et al., 2009; Hamaker, Chapter 3, this volume). 
Finally, ESM is useful for identifying the processes underlying motivation (Brandstatter & 
Eliasz, 2001), such as conflict and differentiation among goals (Emmons & King, 1988, 
1989), the behaviors associated with motives and goals (Bornstein, 1998; Koestner & 
Losier, 1996; McAdams & Constantian, 1983), motivational strategies (Norem & Illing- 
worth, 1993), fine-tuned reactions to unfolding events (Conti, 2001; Schantz & Conroy, 
2009), and identification of central needs (Sheldon, Elliot, Kim, & Kasser, 2001). 
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Although ESM is not used by personality researchers as frequently as it should be, 
its use so far has led to significant advances in the field, and ESM is growing rapidly, as 
evidenced by the dates on the publications referenced in this chapter—almost all are since 
2000, and many are clustered in just the past 2-3 years. We hope this chapter has convinc- 
ingly laid out the case that ESM holds one of the keys for unlocking fundamental concep- 
tual questions in personality. To allow for rapid advances in understanding traits’ func- 
tions, functioning, and manifestation in behavior—normatively and idiographically—we 
urge more personality researchers to consider ESM as both a conceptual framework for 
theory development and a highly useful tool for measurement. 
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Note 


1. Although our focus in this chapter is on ESM studies, and not other daily methods such as the Elec- 
tronically Activated Recorder (Mehl & Robbins, Chapter 10, this volume) and the day reconstruc- 
tion method (e.g., Srivastava, Angelo, & Vallereux, 2008), we consider these similar methods to be 
useful for addressing these central conceptual questions in personality. 
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CHAPTER 30 


Cross-Cultural Research 
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A t some Chinese restaurants in Los Angeles, one can find a peculiar item on the menu: 
“Chinese tamale.” Although it sounds like a Chinese adaptation of the Mexican 
tamale, the dish in question (zongzi) has been made in China for centuries. Noting the 
similarities in form between the two dishes, local Chinese restaurateurs latched onto 
tamale as the perfect translation for their traditional dish. Both dishes may consist of 
a meat filling, encased in a grain-based shell, wrapped in leaves, and cooked by steam- 
ing. However, the traditional tamale shell is made from corn that has been soaked and 
ground into flour, and then wrapped in corn husk or banana leaves; the zongzi shell is 
made from glutinous rice, and is often wrapped in bamboo leaves. If we look further into 
the meat filling, we see that both dishes may use similar types of meat (e.g., pork) but 
different spices and seasonings. The Chinese tamale provides a metaphor for how experi- 
ence sampling methodology (ESM) can contribute to our understanding of cross-cultural 
variation in psychological processes. By investigating the finer moments of everyday life 
in different cultures, researchers can peel back the wrappings of apparent cultural dif- 
ferences (and similarities) and refine their understanding of the nature and processes 
underlying these differences. 

How has ESM contributed to cross-cultural research? What are the unique advan- 
tages of using ESM to study cross-cultural questions? A major premise of cross-cultural 
research is that the universality of psychological processes cannot be presumed but must 
be evaluated across different cultural contexts. By assessing the daily experiences of 
people within a natural setting, ESM enhances the ecological validity of cross-cultural 
studies (see Reis, Chapter 1, this volume). In addition, ESM expands how cultural differ- 
ences and similarities are conceptualized. Early studies in cross-cultural psychology often 
relied on cross-sectional methodology, comparing two or more groups on their responses 
to survey questions or experimental situations. These studies tended to examine cultural 
differences and similarities in group means. Because ESM involves repeated measure- 
ments within and across days, the resulting dataset permits a wider range of phenomena 
to be examined. For example, there may be cultural differences in the extent to which 
momentary states covary with certain variables, as well as differences in the consistency 
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of behavior across situations (intraindividual variation). Moreover, ESM can be com- 
bined with traditional surveys to enhance the types of mean-level comparisons that might 
be made between cultural groups. 


Chapter Summary and Overview 


We divide our review of cross-cultural applications of ESM into five main areas. First, 
we review studies that compare online (via ESM) and retrospective responses (via single- 
session surveys) and show that the two measures lead to different conclusions about 
cultural differences. Second, we review studies that highlight the distinction between 
quantity (i.e., how often certain events occur) and subjective quality (i.e., how events 
are experienced), and demonstrate that cultural differences may exist in either or both 
of these aspects. Third, we review studies that examine cultural differences in intra- 
psychic phenomena or within-person correlations (i.e., how psychological states covary 
with situational factors across cultures). These studies capture processes that may shift 
rapidly across contexts—such as the activation of different cultural identities and subse- 
quent emotions. Fourth, we discuss the potential of ESM data to quantify the amount of 
intraindividual variation across cultures, that is, how much people’s feelings and behav- 
iors vary overall from situation to situation—an issue that is distinct from mean-level 
and correlational studies. With each of the major applications, we discuss the unique 
advantages of using ESM. Fifth and last, we review the challenges associated with using 
ESM in different cultures and discuss directions for future research. 


Global versus Online Measures 
in Culture and Well-Being Research 


A major application of momentary assessments has been in studies of culture and subjec- 
tive well-being (SWB) that combine fine-grained experience sampling data with global 
measures of the same experience. These studies capitalize on the possibility that global 
or retrospective measures are imperfect reflections of online experience that may contain 
memory biases. By employing ESM, such biases are reduced. However, far from pointing 
to the conclusion that global measures of emotion are simply not as good as experience 
sampling measures, these studies have helped advance theoretical formulations of global 
well-being (see Kim-Prieto, Diener, Tamir, Scollon, & Diener, 2005; Schwarz, Chapter 
2, this volume). 

Self-reports can reference a variety of time frames from the most narrow (i.e., evalu- 
ations of the here and now, as captured by ESM) to the broadest (i.e., evaluations of one’s 
life as a whole, as captured in global self-reports). In other words, the distinction is one 
between momentary states and global traits, and self-reports fall in between state and 
trait measures. Different processes are evoked by narrow versus broad time frames, and 
discrepancies among the various measures have provided insight into cultural differences 
in SWB. 

When global measures of well-being are used in cross-cultural studies, Asian respon- 
dents often report lower life satisfaction and less positive emotions than Europeans and 
North Americans (for a review, see Tov & Diener, 2007). These differences concur with 
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cultural norms regarding emotions; that is, Westerners tend to overwhelmingly favor 
positive affect (PA) over negative affect (NA), whereas Asians tend to ascribe value to 
both PA and NA (Eid & Diener, 2001). On the basis of global measures, it appears that 
cultural norms dictate which emotions are desirable to feel, and this in turn influences 
how people regulate their ongoing affective experience. However, studies that use ESM 
along with global reports reveal a more complex picture of how culture shapes well- 
being. For instance, Oishi (2002) compared Asian Americans and European Americans 
on self-reports of SWB using different time frames. He found that European American 
and Asian American students differed neither in their daily ratings of satisfaction (Study 
1) nor in their experience sampling reports of emotion (Study 2). However, when asked 
to evaluate retrospectively the same week, European American participants recalled the 
week as more satisfying than did Asian American participants, a finding that Wirtz, 
Chiu, Diener, and Oishi (2009) replicated. Similarly, Scollon, Diener, Oishi, and Biswas- 
Diener (2004) found greater cultural differences in retrospective reports of emotion than 
in online reports, particularly for NA. 

Cultural differences arise in global and retrospective measures but not in online 
measures, because the former involve reconstructive memory. As people’s memories for 
their experiences fade (which especially occurs when evaluating broad time frames, such 
as “last year” or “in general”), they come to rely less on the actual experience to inform 
their memory and more on heuristic information from sources such as cultural knowl- 
edge or beliefs to “fill in the gaps” of memory. In other words, people tend to form memo- 
ries that are consistent with their self-knowledge and their cultural values (see also Oishi 
et al., 2007), even if the memories are less than accurate. In the case of Oishi (2002) and 
Wirtz et al. (2009), if Asians and non-Asians held different views about the desirability 
of happiness, this may have led to group differences in the retrospective reports. How- 
ever, because cultural knowledge exerts a weaker influence on momentary self-reports, 
researchers observe few or no cultural differences in online reports of emotion. 

In a direct test of this conjecture, Scollon, Howard, Caldwell, and Ito (2009) had 
participants complete ESM and retrospective measures targeting the same week, as well 
as answer questions about ideal affect. Ideal affect is the extent to which people would 
ideally like to feel certain emotions, a measure that is strongly related to culture (Tsai, 
Knutson, & Fung, 2006). Ideal affect more strongly correlated with retrospective reports 
of emotion than with experience sampling reports of emotion, lending support to the idea 
that broader time frames of reporting are more strongly influenced by cultural knowledge 
than the short time frames captured by experience sampling. 

More generally, however, ESM may have an advantage in cross-cultural research 
over global measures because it is potentially less susceptible to reference group effects. 
The reference group effect occurs when different cultural groups respond to subjective 
Likert-type scales with different comparison groups in mind (Heine, Lehman, Peng, & 
Greenholtz, 2002). For example, Japanese people consider other Japanese people when 
answering self-reports, whereas Americans consider other Americans, particularly mem- 
bers of their same ethnicity. The confounding of different reference groups with culture 
undermines the validity of cross-national comparisons of group means and can lead to 
findings such as Japanese respondents scoring higher than Americans in individualism, 
and Americans scoring higher than Japanese in collectivism. ESM may reduce reference 
group effects in at least two ways. First, in experience sampling, respondents may com- 
pare their current states to previous states, so the referent will often be the respondents 
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themselves. This is not to deny that social comparisons do not occur in ESM measures. 
However, unless social comparisons affect responses across all measurement occasions, 
then aggregated ESM measures should be less contaminated by reference group effects 
relative to global measures. Second, ESM can capture the occurrence of events and con- 
crete behaviors—variables for which social comparisons are less relevant. An excellent 
example of this comes from Ramirez-Esparza, Mehl, Älvarez-Bermüdez, and Pennebaker 
(2009), who measured sociable behaviors in everyday life in two cultures. Compared 
with Americans, Mexicans exhibited more sociable behaviors, such as talking with oth- 
ers, but in trait measures of extraversion, Mexicans scored lower. In this case, only the 
experience sampling measures concurred with folk beliefs about Mexican culture. 

Because global assessments tend to be more abstract and require the respondent 
to think in terms of counterfactuals, the concreteness of momentary assessments offers 
another advantage. For some respondents, considering and evaluating a state of affairs 
beyond the here and now is not only difficult but also downright unnatural. For example, 
one item in the Satisfaction With Life Scale (Diener, Emmons, Larsen, & Griffin, 1985) 
is “If I could live my life over again, I would change almost nothing.” Although to urban 
and Western respondents such an item may seem perfectly reasonable, one colleague of 
ours working in the villages of Vietnam reported that many elderly respondents did not 
know how to respond to the item because they could not grasp the hypothetical concept 
of living one’s life over again. 


Locating Subtle Cultural Differences: 
Quantity versus Quality of Experience 


Another advantage of ESM is that the repeated measurements permit two aspects of 
experiences to be operationalized simultaneously: their quantity (or frequency) and 
their subjective quality. The ability to assess both aspects can lead to a more sophisti- 
cated understanding of cultural differences. For example, suppose two cultures differ in 
their mean levels of stress. ESM might reveal that people in each culture spend different 
amounts of time at work, and work hours may mediate cultural differences in stress. 
Researchers might then investigate whether differences in values or economic conditions 
create greater demands to work in one culture versus the other. Alternatively, people in 
both cultures may work the same amount but experience work differently—in which case 
other variables may be relevant, such as the degree of power distance in each culture. The 
psychological responses that are evoked or reinforced in a particular situation can dif- 
fer across cultures. These cultural affordances (Kitayama, Duffy, & Uchida, 2007) can 
range from common social reactions (e.g., supervisors who launch tirades when mistakes 
are made) to public artifacts (e.g., advertisements that inspire career success). When cul- 
tural differences exist in quality but not quantity (or vice versa), researchers gain insight 
into the types of cultural affordances that may be operating. 

Although participants could simply estimate how often they experience certain types 
of events, such measures might misrepresent cultural differences because of the retrospec- 
tion involved. Furthermore, experience sampling may detect experiences that escape ordi- 
nary awareness. People often underestimate the frequency of both PA and NA (Thomas 
& Diener, 1990) because they discount the many times they experienced low levels of 
affect. Frequency estimates may also be biased by various factors such as the amount of 
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time to answer the question or whether the question format is open or closed (Schwarz 
& Oyserman, 2001). By reducing the burden of retrospection, ESM minimizes the effects 
of these extraneous factors. 

In addition, people are occasionally inaccurate in recalling how they felt in certain 
situations. For example, parents often say that spending time with their children brings 
them the most happiness. In reality, taking care of one’s own children ranks as high in 
NA as commuting, and higher in NA than housework (Kahneman, Krueger, Schkade, 
Schwarz, & Stone, 2004)! Similarly, memories of vacations are often glowing (Wirtz, 
Kruger, Scollon, & Diener, 2003) despite frustrations of air travel, sunburn, and annoy- 
ing travel companions. This underscores one of the major points of ESM—-that to develop 
a more accurate account of people’s true experiences, we must capture them through 
momentary assessments. 

Several studies have examined cultural differences in both the quantity and quality 
of experiences. These studies generally use aggregated ESM data to make comparisons 
between group means. For example, the frequency of an experience might be summed for 
each participant, then averaged for each group. However, we will see that the quality of 
experience can be examined in ways other than mean comparisons. 


Differences in Quantity but Not Quality 


Using ESM, Scollon and colleagues (2004) found that Asian respondents reported far 
fewer instances of pride and more instances of guilt than non-Asian respondents. These 
cultural differences in frequency were consistent with cultural norms. Individual pride 
from accomplishing one’s goals or affirming some internal attribute can be “socially 
disengaging” because it emphasizes the separateness of the self from others (Kitayama, 
Mesquita, & Karasawa, 2006). In contrast, guilt is “socially engaging” because it moti- 
vates people to repair social bonds after a transgression. Accordingly, people with an 
interdependent orientation, such as Asians, consider pride less desirable and guilt more 
desirable than individuals with an independent orientation, and their momentary experi- 
ences reflect these values to some extent. However, all the cultural groups in Scollon et 
al.’s study showed similar factor loadings and factor structure of momentary emotions 
(i.e., pride covaried with PA, and guilt covaried with NA), suggesting that the quality of 
experience was similar for both groups. In other words, although Asians experienced 
pride less frequently and guilt more frequently than other groups, these emotions were 
not experienced as any less pleasant or unpleasant. 


Differences in Quality but Not Quantity 


Asakawa and Csikszentmihalyi (1998) assessed the daily activities of Asian and Euro- 
pean American students. Although no ethnic differences emerged in the proportion of 
time spent on academic activities, Asian Americans reported experiencing these activi- 
ties more positively. This might lead to the question of whether Asian American families 
structure their homes in a way that is more conducive to studying, and it illustrates the 
potential of ESM to test and further advance theories about cultural differences. 
Similarly, using event-contingent recording to assess social anxiety experiences, Lee, 
Okazaki, and Yoo (2006) found that Asian Americans and European Americans reported 
similar numbers of anxiety-provoking events, but that Asian Americans reported more 
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intense NA in reaction to these events. Lee et al. suggested that Asian cultural norms that 
discourage the expression of NA may reduce the frequency but enhance the intensity of 
NA. 


Differences in Both Quantity and Quality 


Using event-contingent sampling of social interactions, Wheeler, Reis, and Bond (1989) 
found that Hong Kong students had longer social interactions than American students. 
However, the Hong Kong students had fewer social interactions overall, and the interac- 
tions involved fewer others compared to American students—a difference of both quan- 
tity and quality that is consistent with collectivism. 

Lee and Larson (2000) found that Korean high school students not only spent more 
time studying than European American students but also reported greater NA in response 
to studying (cf. Asakawa & Csikszentmihalyi, 1998). Korean students face an extremely 
competitive college admissions process (only 25% of applicants receive admission), which 
may foster longer hours of studying and more stress. The authors also found that NA 
mediated the cultural difference in overall depression reported by the students. Korean 
students experienced stronger NA than European American students when studying, and 
this difference accounted for the greater depression reported by the former group. 

Grzywacz, Almeida, Neupert, and Ettner (2004) compared adults of different levels 
of education in a 1-week daily diary study. We include educational attainment in our dis- 
cussion of “cross-cultural” research because previous research has shown that socioeco- 
nomic status may influence cultural models of self and agency (Snibbe & Markus, 2005). 
Grzywacz and colleagues (2004) found that better-educated adults reported a higher 
frequency of daily stressors than less-educated adults. However, the stressors reported 
by less-educated adults tended to be more severe in terms of both subjective ratings and 
objective ratings made by trained coders. In other words, people of different educational 
status may both encounter different stressors in their daily life and adapt to some of the 
same stressors in different ways. 

In summary, these studies highlight the distinction between quantity and quality of 
experience that ESM designs help to reveal, and their potential for furthering investiga- 
tions in cross-cultural research. 


Studies of Intrapsychic Phenomena 


Most of the cross-cultural studies reviewed in the previous sections have used momentary 
assessments in aggregated form: The repeated measures for each individual are aver- 
aged, and comparisons are made on group means. However, ESM data can also yield 
important insights into the momentary processes operating within the individual (see 
Hamaker, Chapter 3, this volume). The researcher can examine how a person’s thoughts, 
feelings, and behaviors covary with each other, as well as with specific types of situa- 
tions. We discuss two applications of ESM to study intrapsychic phenomena in cultural 
psychology. First, several researchers have examined how the within-person correlates of 
emotional well-being differ across cultures. Second, ESM has been used to examine situ- 
ational fluctuations in ethnic identity and its emotional correlates. Though not strictly 
cross-cultural, such studies are important, because they enhance the external validity of 
cultural priming theories. 
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Culture as Moderator 


ESM allows researchers to identify relations among situationally variable constructs. 
Culture adds another layer of complexity to the design by allowing investigators to see 
whether the within-person relationships vary according to group membership. Thus, 
these studies reveal the complex ways in which cultural groups are both similar (e.g., 
in the direction of the within-person correlations) and different (e.g., in the strength 
of the within-person associations). For example, Kitayama and colleagues (2006) used 
daily diary methodology to examine the extent to which feelings of engagement and 
disengagement were associated with feelings of happiness within individuals. Although 
engaging positive emotions, such as friendly feelings, was associated with greater happi- 
ness at the momentary level for all participants, the effect was stronger for Japanese than 
for American respondents. Likewise, disengaging positive emotions, such as pride, was 
generally associated with greater happiness, but the effect was stronger for Americans 
than for Japanese. In a similar paradigm, Nezlek, Kafetsios, and Smith (2008) found 
that the relation between self-construal and emotions can differ by culture. Independent 
self-construal was positively associated with PA among British participants but negatively 
associated with PA among Greek participants. 

Moneta (2004) examined flow among American and Chinese students and found 
that culture moderated the construction of flow states. Whereas flow theory states that 
flow is achieved when situational challenges and skills are both high, Chinese students’ 
flow states were characterized by greater skills than challenges. Sorrentino and colleagues 
(2008) investigated the impact of person-environment fit on flow and emotions. Specifi- 
cally, when people’s style of coping with uncertainty matched that of their country, they 
tended to experience more active emotions, such as flow, as well as more PA and less NA 
in general. 

Several studies have examined whether collectivists are sensitive to social context. 
Oishi, Diener, Scollon, and Biswas-Diener (2004), for instance, found that the presence 
of friends was associated with greater momentary PA for all cultural groups in his study, 
but that this effect was stronger for Asian samples than for non-Asian respondents. 
Likewise, the presence of strangers was associated with greater momentary NA, but 
again only for Asian respondents. Similarly, Nezlek, Sorrentino, et al. (2008) found that 
the self-esteem of Japanese people was more reactive to daily events than that of North 
Americans. 

Scollon, Diener, Oishi, and Biswas-Diener (2005) found that culture moderated 
the relation between PA and NA, depending on the level of analysis. Specifically, at the 
momentary or within-person level, PA and NA were negatively correlated for both Asians 
and non-Asians (albeit somewhat less negatively among Asians). However, at the between- 
person level, when emotion ratings for each person were aggregated across moments, PA 
and NA were positively correlated in Asian samples and independent in non-Asian sam- 
ples. It is important to note that the between-person findings were based on aggregated 
ESM data, and not global reports of emotion; thus, the cultural differences cannot be due 
to implicit beliefs about emotions. 


Momentary Fluctuations in Ethnic Identity 


Cultural priming experiments have shown that people possess cultural knowledge that, 
when temporarily activated, guides perceptions and interpretations of the environment. 
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Bicultural individuals, in particular, can rapidly switch cultural frames (Hong, Morris, 
Chiu, & Benet-Martinez, 2000). ESM has added to our understanding of cultural frame 
switching by showing that it occurs naturally and spontaneously, and by identifying 
moderators of this phenomenon. In essence, ESM takes the classic priming laboratory 
experiment and places it in its natural setting, where language and the presence of family 
members can serve as natural cultural primes. 

For example, Yip (2005) found that the presence of the Chinese language and fam- 
ily members activated momentary Chinese identity (e.g., “How Chinese am I at the 
moment?”) among Chinese Americans. These natural priming effects were stronger for 
those with a greater overall Chinese identity (e.g., “How Chinese do I regard myself 
on average?”). Momentary Chinese identity was also associated with greater situational 
well-being, and this effect was stronger among people for whom the Chinese identity was 
central and regarded more positively. 

Moreover, Chinese American students may experience both identities simultaneously 
(Yip, 2009). However, such experiences depend on both the situation and participants’ 
overall identification with Chinese and American culture. Students with a low American 
identity tended to feel both Chinese and American in the presence of classmates. In con- 
trast, students with low Chinese identity tended to feel both identities in the presence of 
their family. Thus, the conditions that led to the simultaneous activation differed across 
students. 

Finally, Perunovic, Heller, and Rafaeli (2007) showed that language can evoke 
culture-consistent psychological responses. East Asian Canadian students in their study 
reported how they felt over the past 2 hours, as well as the language they primarily spoke 
during the same period. When students spoke English, state NA was negatively correlated 
with state PA. However, when an East Asian language was spoken, NA and PA were less 
inversely correlated (see also Scollon et al., 2005). 


Cultural Differences in Intraindividual Variation 


Another possible application of momentary assessment data is examining cultural dif- 
ferences in within-person variation in feelings and behavior. For instance, Oishi et al. 
(2004) examined the within-person variability in emotional intensity across cultures. 
They predicted that the social context would have a greater effect on emotional experi- 
ence for people living in an interdependent culture. Specifically, Japanese participants’ 
emotional experiences should fluctuate more across social situations than those of Euro- 
pean Americans. Consistent with predictions, within-person standard deviations were 
larger for Japanese than for European Americans. 

To date, applications of this approach have been rare, although such data may be 
relevant for testing theories in cross-cultural psychology. For example, the construct 
of tightness—looseness (Gelfand, Nishii, & Raver, 2006; Triandis, 1995) refers to the 
strength of social norms in a society and the extent to which deviations from norms are 
sanctioned. In culturally tight societies (e.g., Saudi Arabia), norms are enforced more 
stringently than in culturally loose societies (e.g., New Zealand). Consequently, some 
theories of tightness—looseness (e.g., Gelfand et al., 2006) posit greater conformity and 
less between-person variability in tight cultures (vs. loose cultures). However, such theo- 
ries could be further developed by considering within-person variability. For example, 
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do people in tight cultures behave more consistently when they are in public settings 
(where norms are more easily enforced) than when they are in private settings? That is, 
one would expect less within-person variability in public settings. ESM could provide 
more precise measurements of both intra- and interindividual variability. In contrast, 
retrospective self-reports of behavioral variability might be biased by cultural norms that 
emphasize following rules and protocol. 


Challenges of Cross-Cultural Applications of ESM 


Conducting ESM across cultural contexts may present unique challenges to the researcher. 
We discuss three of these: participant issues, deciding whether to collect data via paper or 
electronic devices, and measurement equivalence. 


PARTICIPANT ISSUES 


Although technology can make administering ambulatory assessments more convenient, 
ultimately, getting the data, especially high-quality data, requires considerable coopera- 
tion and effort on the part of participants. Certainly there are individual differences in 
the ability to comply with experience sampling protocols. In general, participants who 
are able to complete an ESM study and provide sufficient data are more motivated and 
conscientious than those who drop out, ignore signals, or forget their ambulatory device. 
Given that there are cultural differences in conscientiousness (McCrae, Terracciano, & 
79 Members of the Personality Profiles of Cultures Project, 2005) and compliance, these 
participant issues could affect the reliability of the data for one group relative to the 
others. Although no ESM study has examined this issue directly, Yip (2005) noted that 
compliance was higher in her study of Asian Americans compared to most ESM studies. 
However, individual and group differences need not pose a major problem for experience 
sampling cross-cultural research if researchers take the time to explain to participants the 
importance of the study and of responding thoughtfully. To increase participant compli- 
ance, Conner Christensen, Feldman Barrett, Bliss-Moreau, Lebo, and Kaschub (2003) 
recommend that investigators occasionally phone or e-mail participants to remind them 
of their ongoing participation. 

Convenience and compliance aside, there may also be cultural differences that make 
some groups more reactive in general than others (see Barta, Tennen, & Litt, Chapter 
7, this volume). For example, it is possible that individuals or groups with high social 
anxiety (Okazaki, 1997) may react more negatively to being signaled in the presence of 
others. Even if this is the case, however, problems could be circumvented by using silent 
devices that minimize drawing any attention to the participant when signaled. 


PAPER VERSUS ELECTRONIC DATA COLLECTION 


Electronic devices have a range of advantages: They facilitate data collection, entries can 
be accurately time stamped, and participants can be required to complete their entries 
within a certain window of time (to prevent backfilling). However, in a cross-cultural con- 
text, the researcher must consider whether all groups are equally familiar and comfort- 
able with using such devices. Oishi and colleagues (2004) employed handheld computers 
for their American and Japanese participants but used paper forms and preprogrammed 
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wristwatches for their Indian participants because portable electronic devices were not 
as common in the region at the time. This confounding of method and cultural group 
may be undesirable given the debate regarding the quality of data collected via paper 
(Broderick & Stone, 2006; Green, Rafaeli, Bolger, Shrout, & Reis, 2006). On the other 
hand, using a novel high-tech device could increase reactivity in participants who are less 
experienced with technology. One solution may be to collect data via familiar devices 
such as participants’ own mobile phones (e.g., Song, Foo, & Uy, 2008)—an increasingly 
viable option as mobile phone usage grows in the developing world. 

Other factors might make paper diaries preferable. For example, if research is con- 
ducted in areas of high crime, electronic devices may be stolen and the data lost altogether 
(Tennen, Affleck, Coyne, Larsen, & DeLongis, 2006). Paper diaries could be used for all 
groups, but this would magnify the amount of data entry required. Ultimately, the goal 
should be to collect the highest quality data possible. The decision to use paper or elec- 
tronic methods needs to be assessed by each researcher, and for each cultural group being 
studied. Either way, researchers should do their best to ensure compliance. For example, 
paper entries could be mailed in and time stamped by the post office (Tennen et al., 2006), 
though such procedures are less practical if the experience sampling frequency is high. 


MEASUREMENT EQUIVALENCE 


Another issue in cross-cultural research is ensuring that the measures used have equivalent 
meaning across all groups studied. This issue applies to cross-cultural research in general 
rather than to ESM in particular and can be dealt with on two fronts. First, researchers 
can minimize interpretational ambiguities from the start by making clear to participants 
what the items mean. For example, Oishi and colleagues (2004) gave participants explicit 
details on the meaning of being alone (“wherever you are, there are no other people pres- 
ent, including strangers”), because some people (particularly collectivists) might consider 
themselves alone when in the presence of strangers. Item interpretation and many other 
problems can be reduced with a training session in which researchers walk through an 
entire experiencing sampling form with participants in the laboratory before beginning 
the study (Feldman Barrett’s Experience Sampling Program [ESP] conveniently has a 
training mode feature in which no data are recorded). We urge our participants to ask 
any clarifying questions during the training, so that no ambiguities remain when they 
leave the laboratory and begin the study. 

Second, researchers can establish measurement equivalence across groups by exam- 
ining factor structures. Multigroup means and covariance structure analysis could be 
applied in some instances where single-item ESM data are aggregated (e.g., average daily 
joy) as indicators of between-person constructs (e.g., overall PA). Ideally, the same items 
should load onto the same factors across cultural groups. If factor loadings are uniformly 
higher in one group than in another, spurious group differences might be produced (Chen, 
2008). For example, if some emotion words are better indicators of PA in the United States 
than in China, then average PA could be underestimated in the latter. Notably, Scollon 
et al. (2004) computed average PA by omitting the specific emotion pride for this reason. 
However, the issue of measurement equivalence becomes more complex as the number of 
groups increases, as well as when the researcher wants to evaluate whether constructs are 
comparable across levels (e.g., do state and trait PA have the same structure?). Multilevel 
structural equation models have been applied to cross-sectional data in which people are 
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nested in cultures or groups (Mehta & Neale, 2005; Selig, Card, & Little, 2008) but the 


application of these methods to cross-cultural ESM data (in which repeated measures are 
nested within persons in different groups) is not common at the present time. 


Future Directions 


As researchers become more familiar with the benefits of ESM, we expect to see a greater 
number of cross-cultural studies that make use of these methods. Future applications 
of ESM should co-evolve with theoretical developments in cross-cultural psychology. 
Increasing use of momentary assessments can aid in theory building by helping research- 
ers refine how they think culture influences psychological processes (e.g., quantity vs. 
quality) and expanding the type of differences that can be considered. One possible appli- 
cation of momentary assessments that we have not discussed is examining how cultures 
may differ in terms of daily or weekly cycles of thoughts, feelings, and behaviors (for a 
monocultural example, see Hasler, Mehl, Bootzin, & Vazire, 2008). Such analyses could 
shed light on how everyday life and routines are structured in one society versus another. 
Alternative methods such as the day reconstruction method (DRM; Kahneman et al., 
2004) could also be employed. An open question is whether DRM and ESM methods 
lead to the same conclusions when both are used in cross-cultural research. 

Another intriguing direction is to combine experimental manipulations with momen- 
tary assessments, as illustrated by aan het Rot, Moskowitz, Pinard, and Young (2006). 
Participants who were randomly assigned to take tryptophan reported less quarrelsome 
behaviors over a 15-day period compared with a placebo group. Experimental manipula- 
tions could be administered before an ESM portion, and groups could be compared to 
see how long the effects last in one group versus another. Alternatively, the manipulation 
could be part of the momentary assessments, such as focusing on how versus why an 
event happens (Strack, Schwarz, & Gschneidinger, 1985), and how this affects judgments 
of well-being in different cultural groups. 

With an expansion in the types of phenomena that can be examined cross-culturally 
comes a greater responsibility for researchers to formulate theories that go beyond testing 
for cultural differences to identifying the variables that mediate these differences (Mat- 
sumoto & Yoo, 2006). It is one thing to observe that Asian Americans are more likely 
than European Americans to remember personal events that made their parents happy, 
and yet another to be able to attribute these differences to the greater importance that 
Asian American students place on parental approval (Oishi et al., 2007). The latter find- 
ing contributes to the development of a theory about how values influence our memory 
for events. Thus, an important future direction is not only to locate group differences in 
momentary experiences but also to account for them with theoretically relevant media- 
tors. Here, too, momentary assessments can be informative because differences in daily 
experiences (e.g., time spent with family, chronic accessibility of achievement goals) may 
underlie many cultural group differences. Hence, ESM can be used to measure the media- 
tors of cultural differences (a nice example of this is the Lee and Larson [2000] study 
discussed earlier). 

We want to emphasize, however, that ESM may not be appropriate for all research 
questions and, in some instances, global or integrative assessments may be superior, 
depending on the research objectives. Although we have argued that ESM helps to reduce 
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the biases associated with retrospection, there are cases when global measures predict 
outcomes better than momentary measures, such as when trying to understand people’s 
choices (Wirtz et al., 2003). While moment-to-moment assessments can capture the fine 
details of rapidly fluctuating states with high precision and accuracy, autobiographical 
measures—despite their inaccuracies—offer insights into how people integrate and find 
meaning in their experiences. Ultimately, the costs (in terms of time and complexity) need 
to be weighed against the benefits. 


Conclusion 


Not only does ESM offer a wealth of advantages to research in general but it also is an 
especially powerful tool for cross-cultural research for several reasons. First, because cul- 
ture and reconstructive memory are intimately entwined, global and retrospective mea- 
sures do not always produce accurate conclusions about cross-cultural differences. Sec- 
ond, ESM allows researchers to capture cultural differences in the quantity and quality 
of experiences. Third, ESM allows researchers to examine intraindividual phenomena, 
including processes such as ethnic identity and cross-situational consistency. Most impor- 
tant, ESM has transformed how we conceptualize cross-cultural questions. A field that 
once addressed primarily mean-level differences and similarities can now ask sophisti- 
cated and multilayered questions regarding covariance structure and dispersion. In short, 
if the major premise of cross-cultural research is to understand whether psychological 
processes are universal or culture specific, then ESM provides the fine-grained resolution 
to view the texture of psychological phenomena as they operate within a particular indi- 
vidual in a particular situation and culture at a particular time. 
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CHAPTER 31 


Positive Psychology 


JAIME L. KURTZ 
SONJA LYUBOMIRSKY 


n its relatively brief history, positive psychology has used a variety of research method- 

ologies to better understand the nature of happiness and the “good life.” In the face of 
scant empirical knowledge of positive emotions (cf. Diener, Suh, Lucas, & Smith, 1999), 
tightly controlled randomized laboratory studies have begun to establish the causes, cor- 
relates, consequences, and basic mechanisms underlying positive emotions and global 
happiness. Such experiments represent a significant advance over the correlational survey 
studies that informed researchers’ understanding of well-being for decades. In the pur- 
suit of triangulation, an important next step is to examine the real-world circumstances 
under which people feel good and thrive. According to Seligman and Csikszentmihalyi 
(2000), “Psychology should be able to help document what kinds of families result in 
children who flourish, what work settings support the greatest satisfaction among work- 
ers, what policies result in the strongest civic engagement, and how people’s lives can 
be most worth living” (p. 5). Realizing these goals requires the study of positive experi- 
ences and positive emotions in everyday life and in real time. Not coincidentally, perhaps, 
experience sampling methods, which are used to assess aspects of daily life as they occur 
in the moment, have their origins in positive psychology (Csikszentmihalyi, Larson, & 
Prescott, 1977). 

Although a variety of positive experiences, traits, and states—joy, anticipation, opti- 
mism, compassion, curiosity, inspiration, gratitude, awe—have been the focus of research 
in positive psychology, the majority of studies have examined happiness, or subjective 
well-being. Consequently, this chapter focuses on methodological advancements in the 
area of well-being. However, as we argue later, such methods are ripe to be extended and 
applied to other positive psychological constructs as well. 


Global Reports of Subjective Well-Being 


Happiness—or subjective well-being (SWB)—is commonly described by researchers as 
comprising both a “hot” affective component and a “cold” cognitive component (Diener 
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etal., 1999). First, happy people typically experience frequent positive emotions and rela- 
tively infrequent negative emotions. Second, when “coldly” evaluating their lives, happy 
people report high levels of satisfaction with life as a whole and with specific domains, 
such as work, health, and interpersonal relationships. Following the positive psychologi- 
cal literature, we use the terms subjective happiness and subjective well-being synony- 
mously in this chapter. These constructs are labeled “subjective” because the evaluation 
is drawn from the individual’s own report, and the individual is assumed to be in the best 
position to make it. In Myers and Diener’s (1995) words, the best judge of happiness is 
“whoever lives inside a person’s skin” (p. 11). 


Commonly Used Global Measures 


In keeping with this definition of subjective well-being, the majority of positive psycho- 
logical research uses broad, global self-reports to assess a person’s happiness. For exam- 
ple, for many years, single-item measures such as “Taking all things together, how would 
you say things are these days?” (Campbell, Converse, & Rodgers, 1976) were common. 
Such single-item measures are still used, particularly when questions about SWB are 
included in much longer and broader surveys, such as a nationally representative census 
(e.g., Lucas, Clark, Georgellis, & Diener, 2004). Given their brevity, these measures are 
surprisingly reliable. For example, Fujita and Diener (2005) found a test-retest reliability 
of .55 over a 17-year period. 

Despite these strengths, slightly longer scales are used more frequently. The Subjec- 
tive Happiness Scale (Lyubomirsky & Lepper, 1999), a 4-item measure of happiness, 
includes the prompt, “In general, I consider myself ... ,” with responses ranging from 1 
(Not a very happy person) to 7 (A very happy person). It also asks, “Compared to my 
peers, I consider myself...” (1 = Less happy, 7 = More happy), “Some people are gener- 
ally very happy. They enjoy life regardless of what is going on, getting the most out of 
everything. To what extent does this characterization describe you?” (1 = Not at all, 7 
= A great deal), and “Some people are generally not very happy. Although they are not 
depressed, they never seem as happy as they might be. To what extent does this charac- 
terization describe you?” (1 = Not at all, 7 = A great deal; reverse-scored). 

Probably the most commonly used scale, the Satisfaction with Life Scale (Diener, 
Emmons, Larson, & Griffin, 1985), assesses the cognitive component of SWB with five 
broad questions, all answered on 1-point (Strongly disagree) to 7-point (Strongly agree) 
Likert-type scales: “In most ways my life is close to my ideal,” “The conditions of my life 
are excellent,” “I am satisfied with life,” “So far I have gotten the most important things 
I want in life,” and “If I could live my life over again, I would change nothing.” 

A growing number of cross-sectional and longitudinal studies have used these broad 
SWB measures to establish that global happiness relates to a number of desirable out- 
comes (see Lyubomirsky, King, & Diener, 2005, for a review). For instance, in the realm 
of work, happy people—that is, those who score highly on SWB measures—are more 
likely to graduate from college (Frisch et al., 2004), to be favorably evaluated by work 
supervisors (Staw, Sutton, & Pelled, 1994), and to make more money (Lucas et al., 2004). 
Dispositionally happy people also have better social relationships. They are prosocial 
and enjoy helping others (Krueger, Hicks, & McGue, 2001). They have a large network 
of friends and are relatively more likely to be in an enduring, committed, and satisfy- 
ing romantic relationship (Diener & Seligman, 2002). Happy people are also less likely 
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to suffer from psychopathologies such as depression and anxiety (Diener & Seligman, 
2002). Perhaps most strikingly, happy people experience superior health and longevity 
(Danner, Snowdon, & Friesen, 2001). 

In summary, the literature cited earlier exemplifies some of the many studies that 
have demonstrated meaningful links between important life outcomes and happiness, 
as operationalized by global measures of SWB, thus providing evidence regarding their 
validity. Furthermore, these measures have been found to be psychometrically sound 
and to show high levels of reliability (Fujita & Diener, 2005; Lucas & Donnellan, 2007). 
Finally, commonly used measures of SWB are efficient and cost-effective. 


Problems with Global Well-Being Measures 


Although global SWB measures possess many strengths, they are also characterized by 
several notable shortcomings. Consider items such as “The conditions of my life are 
excellent” and “I am satisfied with life” (Diener et al., 1985). What information does 
a respondent use when making these broad assessments? Global scales of SWB require 
participants to reflect on their lives, accurately recall and appropriately weigh many dis- 
crete life episodes, then offer a reasonably unbiased assessment of their overall happi- 
ness. According to Schwarz and Strack’s (1999) judgment model of SWB, there are many 
reasons why this may be difficult for participants. First, one’s momentary mood can 
serve as mental shorthand when making judgments about the broad quality of one’s life 
(e.g., “I feel cheerful right now, so my life must be going well”; Schwarz & Clore, 1983), 
although some argue that this effect is smaller than originally thought (Eid & Diener, 
2004). Related to this point, SWB assessments can be influenced by immediate but very 
minor circumstances, such as finding a dime before being asked how generally happy 
one is (Schwarz & Strack, 1999). Global ratings are also highly sensitive to whatever 
information is made accessible prior to providing them. For example, Strack, Martin, 
and Schwarz (1988) found no correlation between life satisfaction and dating satisfac- 
tion when a question about life satisfaction preceded a question about dating status (e.g., 
single vs. in a relationship). However, when the question order was reversed, a significant 
positive correlation emerged. When respondents’ dating status was made accessible, they 
appeared to use it as a heuristic (or shortcut) to help judge their overall life satisfaction. 
Finally, social comparisons—even relatively arbitrary ones—can push global SWB rat- 
ings in one direction or another, depending on whether the comparison is favorable or 
unfavorable. For example, encountering an experimenter in a wheelchair creates a down- 
ward social comparison that leads to inflated SWB ratings (Strack, Schwarz, Chassein, 
Kern, & Wagner, 1990). 

Taken together, these biases highlight that global ratings of SWB do not always 
represent well-integrated, unbiased assessments of the relevant moments in a person’s 
life. Instead, well-being judgments appear to be based on a truncated search that is sen- 
sitive to temporally accessible information. Indeed, in many cases, researchers do not 
know what information is being used when a specific participant is answering very broad 
questions about his or her SWB, rendering the results open to interpretation. Of course, 
random assignment to condition can minimize this problem in experimental paradigms, 
thus converting it into a source of random error that is evenly dispersed across condi- 
tions. In correlational or survey data, however, concerns with context-dependent effects 
are arguably more critical. Despite this situation, global measures of SWB such as the 
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Subjective Happiness Scale (Lyubomirsky & Lepper, 1999) and the Satisfaction with 
Life Scale (Diener et al., 1985) are remarkably temporally stable and predictive of a 
variety of important positive outcomes, as delineated earlier. Such cost-effective and self- 
administered self-report measures will undoubtedly continue to be frequently used in 
SWB research. However, the problems inherent in retrospective global reports of happi- 
ness have led some researchers (e.g., Krueger, Kahneman, Schkade, Schwarz, & Stone, 
2009) to propose alternative methodologies to allow for relatively less biased real-time 
happiness assessments. 


The Experience Sampling Method 


Measures of global SWB clearly tap into the cognitive component of happiness—that is, 
the top-down, global evaluation of one’s life (Diener et al., 1999). By contrast, the expe- 
rience sampling method (ESM), sometimes referred to as ecological momentary assess- 
ment, is more suitable for assessing the affective component—namely, frequent positive 
affect and infrequent negative affect. ESM is based on a bottom-up conceptualization 
of happiness. From this perspective, happiness is the aggregate of affective experiences 
encountered throughout daily life. 

With its roots in positive psychology, the ESM actually came out of Csikszentmi- 
halyi’s (1990) work on flow, a positive state of intense focus and engagement with a 
challenging activity. Because Csikszentmihalyi was interested in knowing when people 
experience flow in their everyday lives, outside of the laboratory, he needed an ecologi- 
cally valid methodology that allowed for sampling participants throughout the course of 
the day—while they were at school, work, or leisure—as an alternative to retrospective 
reporting methods, such as interviews and questionnaires. 

In the mid-1970s, Csikszentmihalyi and colleagues were among the first to adopt 
ESM in their work on adolescents (see P. Wilhelm, Perrez, and Pawlik, Chapter 4, 
this volume). Participants were given electronic pagers, which they carried everywhere 
throughout the day. The pagers were programmed to signal the participants randomly, 
with each signal serving as a prompt immediately to report their thoughts and feelings 
using a booklet of paper-and-pencil self-report forms. Common open-ended questions 
on these forms were “As you were beeped, what were you thinking about?”; “Where 
were you?”; “What was the main thing you were doing?”; and “What other things were 
you doing?” Using Likert scales, participants also rated the extent to which they were 
concentrating, in control of the situation, feeling good, and living up to their expecta- 
tions—all hallmarks of the flow state (Csikszentmihalyi, Larson, & Prescott, 1977). 
Aggregated over the course of days and weeks, these data allowed researchers to opera- 
tionalize the flow state more clearly, examine the types of activities and mental states 
that are conducive to flow, and correlate frequency of flow with characteristics of the 
person (Csikszentmihalyi & Larson, 1987). 

The new ESM methodology also allowed researchers to understand more fully just 
how people spend their days. For example, by examining the time periods during which 
participants were engaging in particular activities, Csikszentmihalyi and his colleagues 
were able to estimate how much time people typically spent at work, at school, and at 
leisure, and how they felt during these different episodes. This research generated some 
surprising findings, calling into question common folk beliefs about work and leisure. To 
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wit, people were found to report more flow-type engagement (e.g., concentration, loss of 
a sense of time, and challenge) at work than they did at leisure, which was often spent 
on passive activities such as watching television (Csikszentmihalyi, 1990). With the help 
of ESM, this investigation demonstrated a disconnect between lay theories and actual 
experience. Work is often perceived as a necessary evil, something to “get through” 
before earning carefree leisure time. But, as Csikszentmihalyi argues, leisure time should 
promote happiness only when it is structured and challenging (rather than passive and 
unengaging). This is an example of a pattern of results that may not have been identified 
with scales assessing people’s overall views or retrospective memories of the affective 
experience of work and leisure. 

Since Csikszentmihalyi and colleagues’ (1977) early work, the term experience 
sampling has broadened to include both paper-and-pencil and computerized methods 
of responding. The key property shared by these methods is that participants provide 
reports in their everyday lives—either as soon as possible after being signaled or follow- 
ing a particular event (Conner, Tennen, Fleeson, & Feldman Barrett, 2009). ESM studies 
may be several days’ to several months’ duration, depending on the goals and resources of 
the researchers, and they have been used to study positive states beyond the flow experi- 
ence. Nevertheless, in recent years, the majority of ESM studies have examined people’s 
momentary experiences of positive and negative affect (e.g., Lucas & Diener, 2001). 


Daily Diary and Day Reconstruction Methods 


Despite its many benefits, experience sampling is costly for researchers, requiring a great 
deal of participant time and cooperation, as well as an initial investment in pagers or 
personal data assistants, which may not always be returned in working order at study’s 
end. When considering the large-scale data collections of SWB indicators that some posi- 
tive psychologists and policymakers have called for (Diener & Seligman, 2004; Krueger 
et al., 2009), traditional ESM methods become prohibitively expensive and even logisti- 
cally impossible. ESM can also be burdensome and intrusive for participants, who may 
be unwilling or unable to respond to a signal or page when it occurs. Finally, due to the 
random nature of the sampling process, significant and meaningful, but rare, daily events 
may be missed. 

To address these challenges while simultaneously preserving the relatively undis- 
torted, online accounts provided by ESM techniques, Kahneman, Krueger, Schkade, 
Schwarz, and Stone (2004) proposed a type of short-term daily diary called the day 
reconstruction method (DRM) as an alternative to ESM. Using a diary format, partici- 
pants using DRM are essentially asked to generate a detailed account of an entire day, 
broken down into distinct episodes. Their typical instructions are as follows: “Think 
of the episodes of your day. An episode can begin or end when you move to a different 
location, change activities, or change the people you are with” (Kahneman et al., 2004, 
p. 1777). Each episode is furnished with a concise label (e.g., “trip to grocery store,” 
“lunch with a friend”), as well as a brief description of where the participant was during 
the episode, what he or she was doing, and with whom. Then, the episode is rated using 
a variety of adjectives (e.g., happy, competent, interested, tense, tired) on scales ranging 
from 0 (Not at all) to 6 (Very strongly). Findings from this paradigm include a study of 
909 working women, revealing that time spent in intimate relations, socializing, relaxing, 
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praying or meditating, and eating were among the most enjoyable, whereas commuting, 
working, and child care were among the least enjoyable (Kahneman et al., 2004; Krueger 
et al., 2009). 

The fact that child care was rated so low may seem counterintuitive, as it is incon- 
sistent with widely held beliefs that raising children is personally meaningful and gratify- 
ing (Kenrick, Griskevicius, Neuberg, & Schaller, 2010; Lyubomirsky & Boehm, 2010). 
Indeed, because DRM has been used to track affective experience throughout the day, it 
may be ignoring important—but fleeting or infrequent—experiences, such as a sense of 
meaning, perceived connection to others, and engagement. In a large-scale online survey, 
White and Dolan (2009) examined the positive and negative feelings associated with 
various episodes in a day, while also broadening the series of questions to examine the 
thoughts that accompany each episode. Participants reported the extent to which each 
daily episode was personally meaningful, worthwhile, useful to other people, satisfying, 
and helpful in achieving important goals (0 = Not at all, 6 = Very strongly). This exten- 
sion allowed the researchers to explain Kahneman and colleagues’ (2004) paradoxical 
findings, concluding that work and time spent with children were actually highly person- 
ally rewarding, whereas passive leisure activities such as television and general relaxation 
were relatively less rewarding. Moreover, from a methodological standpoint, White and 
Dolan (2009) demonstrated that DRM can be used to examine momentary thoughts, as 
well as momentary affect. 

Although ESM is still considered the “gold standard” for the study of happiness in 
everyday life, despite its cost and generally intrusive nature, DRM has proven a viable 
alternative with impressive psychometric properties. For example, DRM-reported nega- 
tive affect has been shown to correlate positively with resting heart rate (Daly, Delaney, 
Doran, Harmon, & MacLachlan, 2010). Another study found that quality of flow expe- 
riences (Csikszentmihalyi, 1990) was highly positively correlated with DRM-reported 
positive affect and negatively correlated with DRM-reported negative affect (Collins, 
Sarkisian, & Winner, 2009). DRM has been used successfully in samples of college stu- 
dents (Srivastava, Angelo, & Vallereux, 2008), middle-aged working women (Kahneman 
et al., 2004; White & Dolan, 2009), and retirees (Oishi, Whitchurch, Miao, Kurtz, & 
Park, 2009). 

Although DRM is arguably less expensive and intrusive than ESM, it still requires 
a large time commitment from participants, with the daily diary taking approximately 
45-75 minutes per day (Kahneman et al., 2004). Other positive psychologists have used 
similar but less time-consuming daily reporting methods (see Gunthert & Wenze, Chap- 
ter 8, this volume). For example, some studies require participants simply to report the 
emotions they experienced that day, or sometimes over several days, generally using 
Likert scales. In addition to overall positive and negative affect, experiences of excite- 
ment, interest, guilt, gratitude, pride, and anxiety, among others, have been tracked (e.g., 
Algoe, Haidt, & Gable, 2008; Cohn, Fredrickson, Brown, Mikels, & Conway, 2009; 
Oishi, Schimmack, & Colcombe, 2003). 


What Novel Information Is Gained 
from Real-Time Measures? 


Studying SWB in everyday life, whether through experience sampling or a daily diary 
methodology, is arguably more costly and inconvenient than administering a brief, one- 
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shot SWB measure. Hence, an important question concerns what precisely can be learned 
from these “online” measures. Do momentary or daily assessments provide information 
that broader, more global reports of SWB do not? 

As mentioned earlier, these methodologies confer novel information about how peo- 
ple use their time, how they feel, and what they are thinking during different kinds of 
activities (Csikszentmihalyi et al., 1977; Krueger et al., 2009; White & Dolan, 2009). 
Momentary and daily diary measures have also been able to establish the distinctive- 
ness of the affective and cognitive components of happiness. For example, Cohn and 
colleagues (2009) provided compelling evidence for the unique predictive value of the 
affective component (i.e., positive affect), relative to the cognitive one (i.e., life satisfac- 
tion). For a month, participants supplied daily emotion reports using a Web-based daily 
diary methodology. Specifically, participants were instructed to reflect on their day, and 
then report their strongest experience of each of 18 discrete emotions (e.g., joy, pride, 
gratitude, awe, anger, fear, embarrassment, disgust) in that day, using a 5-point scale (0 = 
Not at all, 4 = Extremely). Life satisfaction (assessed with the Satisfaction with Life Scale; 
Diener et al., 1985) and ego resilience (the ability to be flexible in response to challenging 
or changing circumstances; Block & Kremen, 1996) were measured at the beginning and 
end of the study. Mediation modeling revealed the unique contribution of positive affect 
in predicting ego resilience, as reported in the computerized daily diaries, when control- 
ling for general life satisfaction. This study offers persuasive evidence that daily reports 
of affect are distinct from one-time global evaluations of life satisfaction. 

Online measures of positive affect and enjoyment have also demonstrated interesting 
disconnects from their more global retrospective counterparts in the realm of judgment 
and decision-making. For example, Mitchell, Thompson, Peterson, and Cronk (1997) 
used ESM methodologies to study “rosy prospection” (or anticipation of future experi- 
ences) and “rosy recollection” (or memories of past experiences). Using a variety of famil- 
iar experiences, such as a vacation abroad, Thanksgiving break, and a bicycle trip, they 
found that people’s predictions and memories of those experiences were more positive 
than were their actual online (or momentary) experiences. 

A similar study examined the relationship between online and retrospective reports 
of a college spring break trip (Wirtz, Kruger, Napa Scollon, & Diener, 2003). Partici- 
pants reported their anticipated levels of affect 2 weeks prior to their trip, and were given 
personal digital assistants (PDAs) to take with them on the trip. The PDAs signaled them 
several times a day, at which point they reported their affect and enjoyment. They also 
completed retrospective reports of their spring break trip several days after being back on 
campus, and again 4 weeks after spring break. At this final measurement point, they also 
reported the extent to which they wished to take a similar trip in the future. The research- 
ers’ findings suggested a discrepancy between the anticipated, online, and recalled experi- 
ences of positive affect and enjoyment. Specifically, a strong correlation emerged between 
anticipated and recalled affect, but associations with the online reports were substan- 
tially weaker. As in the Mitchell and colleagues (1997) studies, online reports generally 
indicated greater negativity than had anticipated or recalled. This is not surprising given 
that vacations are often filled with neutral or even negative moments (e.g., waiting in 
line, feeling tired and irritable) that may not be taken into account when anticipating and 
investing in a future vacation. When looking back, perhaps with the assistance of mean- 
ingful souvenirs and a desire to reduce dissonance (Mitchell et al., 1997), the neutral and 
stressful moments of a vacation can easily fade from memory, leaving a biased recollec- 
tion of an enjoyable vacation. Interestingly, Wirtz and colleagues (2003) found that it was 
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actually the retrospective accounts of positive affect and enjoyment that predicted the 
desire to go on a parallel adventure in the future. 

The phenomenon of “duration neglect” is another, related source of the divergence 
between online and retrospective accounts. According to the peak-end rule, recollec- 
tions of an experience are most powerfully influenced by its emotional high point (peak) 
and its ending (Fredrickson & Kahneman, 1993). When people recall and evaluate past 
experiences, they are inclined to neglect the duration of those experiences. As a result, 
retrospective ratings of happiness are likely to be fundamentally flawed. The discrepancy 
between online and recalled affect is more than a topic for academic debate. A practical 
application of this research is that greater insight into momentary affective experience 
could promote more optimal, happiness-boosting decision making. 

The disconnect between online and global or retrospective accounts has fostered a 
lively debate within positive psychology about not only how best to measure happiness 
but also the very nature of happiness itself. Is happiness signified by an individual’s global 
evaluation of his or her life, or is it the aggregate of many moments, as measured by ESM? 
Consider some of the most counterintuitive findings about happiness, such as the classic 
study of lottery winners and paraplegics (Brickman, Coates, & Janoff-Bulman, 1978) or 
a study comparing the happiness of Southern Californians and Midwesterners (Schkade 
& Kahneman, 1998). Taken together, such studies, which use broad, global measures of 
happiness, provide evidence for the existence of a hedonic treadmill—namely, that people 
typically adapt to their life circumstances (e.g., winning money, becoming confined to a 
wheelchair, or moving to Southern California), such that any momentary increases or 
decreases in their happiness after such events are unsustainable as they gravitate back to a 
hedonic set point (Brickman et al., 1978; Diener, Lucas, & Scollon, 2006; Lyubomirsky, 
Sheldon, & Schkade, 2005). 

An intriguing question is whether such studies would still evince evidence of adap- 
tation if they included an experience sampling component. Perhaps not. A wheelchair- 
bound participant in an ESM study may frequently be paged during moments of discom- 
fort or a sense of futility. A Southern Californian may be paged while sitting in a traffic 
jam; a Midwesterner may be paged on a warm, sunny day. Although these ordinary life 
experiences might not carry enough weight to affect global ratings of SWB, they arguably 
produce a strong affective experience in the moments during which they occur. 

Although this issue is interesting from both conceptual and methodological stand- 
points, online and global reports of well-being frequently complement each another. For 
example, as described earlier, many studies have established a robust association between 
the quality of a person’s social relationships and his or her global SWB (Diener & Selig- 
man, 2002). Consistent with these findings, a study that used an ESM to examine the 
link between social interactions and moods throughout the course of a day found that 
momentary mood was significantly more positive when participants reported being in the 
presence of others compared to being alone (Lucas & Diener, 2001). 

Another example of congruence between real-time and global reports comes from 
a recent longitudinal study of happiness over the lifespan (Carstensen et al., in press). 
These researchers used ESM to predict longevity and other important outcomes over a 
10-year period, with participants reporting their emotional experiences five times per day 
over the course of a week. Furthermore, this ESM procedure was repeated 5 years and 
10 years later. Frequency of positive emotions (relative to negative emotions) experienced 
throughout the day was significantly related to longevity. Notably, however, they also 
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found that Lyubomirsky and Lepper’s (1999) Subjective Happiness Scale was highly cor- 
related both with momentary positive affect and with longevity. 

Although a case can be made for always using online measures in well-being research 
(Krueger et al., 2009), such measures should be of higher priority in situations when 
researchers have reason to believe that they will provide information above and beyond 
that of global well-being measures. After all, observation of disconnects between online 
and global SWB ratings may be especially likely in certain types of situations. First, 
experiences that are self-contained and physically arousing—like the bicycle trip studied 
by Mitchell and colleagues (1997)—may be difficult to reconstruct and evaluate accu- 
rately. Due to duration neglect, for example, when making retrospective ratings of the 
experience, cyclists may fail to consider appropriately the length of time they felt tired 
or uncomfortable on the trip (Fredrickson & Kahneman, 1993). Furthermore, despite 
the moments of physical pain and exhaustion the cyclists may have experienced in real 
time, later on they are likely to have difficulty mentally recreating these physical sensa- 
tions (Loewenstein, 1996). In hindsight, they may know that the trip was difficult at 
some level, but they will be unable to recall fully just how physically uncomfortable they 
felt. A related, more overtly motivated reason for a disconnect applies to an experience 
that has a personally relevant outcome. In an attempt to maintain self-esteem, as well as 
to reduce dissonance, people may recall an experience as more happy or more positive 
than it really was. For example, after the bicycle trip, the cyclists may have preferred to 
recall moments when they felt strong and fit rather than weary and defeated. They may 
also have wished to persuade themselves that the decision to make the trip was the right 
one. Finally, temporal construal theory (Trope & Liberman, 2003) predicts that, as time 
passes, a self-contained experience such as a bicycle trip will increasingly be recalled in 
terms of abstract features (i.e., personal growth, life experience) rather than mundane, 
concrete features (i.e., rain, sore legs) that characterized the trip in the moment. For these 
reasons, a clear divergence between online and retrospective measures of happiness can 
be expected, in keeping with the idea of the rosy view (Mitchell et al., 1997). 

Finally, researchers should consider the extent to which people’s global reports of 
what makes them happy might be biased by preconceptions and cultural values. For 
example, although people tend to rate leisure time as desirable and pleasant, ESM stud- 
ies reveal that many do not enjoy it as it occurs (Csikszentmihalyi, 1990). Similarly, as 
described earlier, vacations are eagerly anticipated and recalled fondly but do not seem to 
be nearly as pleasant in the moment (Wirtz et al., 2003). This problem can be partially 
addressed by ordering survey items so that happiness is reported first; hence, beliefs that 
may bias responses are made relatively less accessible (Schwarz & Strack, 1999). The 
deeper definitional issue, however, still remains. A multimethod approach that uses both 
online and global or retrospective measures is ideal. 


Other Positive Psychology Constructs 


Positive psychology researchers aim to understand a variety of positive states. To this end, 
online measures may serve as a valuable tool in investigations of other positive psycho- 
logical constructs. A notable example comes from recent work on inspiration. The first 
brief, global trait measure of inspiration, created by Thrash and Elliot (2003), includes 
items such as “I experience inspiration” and “I am inspired to do something” (p. 889). 
This scale has been shown to correlate in the predicted direction with a number of posi- 
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tive constructs, such as intrinsic motivation, openness to experience, positive affectivity, 
and creativity. The researchers also examined the extent to which people are inspired in 
everyday life, using a daily diary method. Over 2 weeks, participants received a daily 
e-mail containing a prompt and a questionnaire. They reported the extent to which they 
felt inspired throughout the day, as well as a number of correlates of inspiration, such 
as creativity, positivity, competence, openness, and freedom. When frequencies of these 
experiences were aggregated over the course of the 2-week study, the findings revealed 
that these constructs often co-occur. Moreover, the diary method allowed for testing 
directional relationships between constructs. For example, inspiration was shown to pre- 
cede feelings of creativity, but not vice versa. 

Thrash and his colleagues also found that reports of inspiration in the morning are 
predictive of well-being later in the day (Thrash, Elliot, Maruskin, & Cassidy, 2010), 
and that feelings of inspiration mediate the relationship between having a creative idea 
and a creative end product, but that other positive states, such as awe, effort, and posi- 
tive affect, do not (Thrash, Maruskin, Cassidy, Fryer, & Ryan, 2010). Going beyond 
correlations to establish the temporal precedence of inspiration would be difficult, if not 
impossible, with traditional trait-like measures. By employing a daily diary methodol- 
ogy, the relationship between inspiration and related constructs becomes much more 
interpretable. 

An important distinction between happiness and other positive psychology con- 
structs is worth noting. People can fairly easily report on their affective state much of the 
time. Indeed, what makes ESM possible is that people are seldom feeling nothing, affec- 
tively speaking (Diener, Sandvik, & Pavot, 1991). By contrast, because other types of 
positive experiences, such as inspiration, do not occur frequently in everyday life, obtain- 
ing a random sample of moments throughout the day is likely to miss such experiences. 
Recounting one’s day in a diary format or using an event-contingent sampling method 
(see Moskowitz & Sadikaj, Chapter 9, this volume) appears to be most appropriate for 
relatively more rare types of positive experiences. 


Challenges Involved in Real-Time Measurement 


In the words of economist John Stuart Mill (1873/1989), “Ask yourself whether you are 
happy and you cease to be so” (p. 94). Consistent with this notion, participating in an 
ESM or daily diary study may encourage respondents to reflect on their own happiness 
more than they would otherwise. Excessive focus on and monitoring of happiness levels 
(i.e., “Am I happy yet? Am I happy yet?”) is thought to be counterproductive, prompting 
a reduction in positive affect (Conner & Reid, 2011; Schooler, Ariely, & Loewenstein, 
2003). For this reason, researchers may want to keep the signals per day in ESM studies at 
a reasonable number, with the goal of obtaining adequate data to address their research 
questions without engendering reactance in their participants (see Barta, Tennen, & Litt, 
Chapter 6, this volume; Miron & Brehm, 2006). 

Another challenge to studies of everyday life is that participants are expected to 
provide reports as soon after being signaled as possible. This charge, however, is some- 
times impossible or highly unlikely, which results in important experiences being missed. 
Diary studies, by contrast, face a different challenge because they require that reports be 
made at the end of each day. Consequently, study participants may be tempted to turn 
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in backdated entries, rendering their reports prone to retrospective biases. Fortunately, 
new technologies are mitigating this problem because computerized diary methodologies 
and online submission methods provide a time stamp for the completion of each diary. 
This procedure both increases compliance and discourages backdating (Stone, Shiffman, 
Schwartz, Broderick, & Hufford, 2003). 

Finally, despite the rich information to be gained by online methodologies, such 
methodologies are arguably underutilized in psychological research because they require 
an initial investment of time and money, and the data may be difficult to analyze (Con- 
ner et al., 2009). However, such limitations may diminish as new technologies evolve and 
expertise becomes more widespread. 


Looking Ahead 


Recent technological advances, from specialized websites to global positioning systems 
(GPS), have taken the study of happiness in everyday life to a new level. Authentichappi- 
ness.org is one example of a website that allows individuals to create an account to track 
their happiness over time, providing both customized feedback and a source of data for 
researchers. This site contains a wide variety of validated measures commonly used in 
positive psychological research and has attracted a large number of participants (300 per 
day) who complete measures without financial compensation. Data from this site have 
been used to demonstrate the efficacy of happiness-promoting techniques in an adult 
sample of over 400 (Seligman, Steen, Park, & Peterson, 2005). 

The ubiquity of mobile phones with text messaging and Internet capabilities also 
creates exciting new possibilities for the study of happiness in everyday life. Recent 
research on the quality of family interactions, for example, used text messaging to signal 
participants to provide ESM reports, eliminating the need for pagers or PDAs (Rönkä, 
Malinen, Kinnunen, Tolvanen, & Lämsä, 2010). Smartphones such as the BlackBerry 
and the iPhone can serve as platforms for applications created for the specific purpose 
of monitoring and increasing happiness. Because many individuals hold happiness as a 
highly desired goal, they do not have to be compensated for submitting their data. In fact, 
people are willing to pay to access some of these applications; hence, “self-help” applica- 
tions with names like the Habit Factor, Gratitude Journal, and iStress have proliferated. 

Although this new technology is in the early stages, researchers have begun using it 
to obtain online data from a large number of participants. LiveHappy™ (Lyubomirsky, 
Della Porta, Pierce, & Zilca, 2010), an inexpensive iPhone application, is geared toward 
increasing participants’ happiness by encouraging the performance of empirically vali- 
dated activities such as expressing gratitude, focusing on meaningful goals, savoring the 
moment, performing acts of kindness, nurturing interpersonal relationships, and focus- 
ing on “best possible selves.” The iPhone itself is used as a tool to facilitate engagement 
in these activities—for example, users might express their gratitude by emailing, texting, 
or calling someone on their “contact list.” Unlike traditional ESM, this application does 
not signal participants to report their affect in the moment. Rather, participants choose 
to provide information about their current mood and their overall global happiness, as 
measured by the Subjective Happiness Scale (Lyubomirsky & Lepper, 1999). They can 
also determine the extent to which a particular happiness-promoting activity “fits” with 
their preferences and goals (Lyubomirsky, Sheldon, et al., 2005). Results are then stored 
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to track within-person change over time. Preliminary data are promising, showing sig- 
nificant increases in positive mood after participation in the activities available on the 
application, especially those involving the expression of gratitude, nurturing relation- 
ships, and visualizing one’s best possible self (Lyubomirsky et al., 2010). Additionally, by 
providing information on the sorts of activities people naturally enjoy doing and opt to 
do, this methodology is a useful complement to laboratory studies in which participants 
are randomly assigned to take part in a specific happiness-promoting activity (see Sin & 
Lyubomirsky, 2009, for a review). 

By contrast, Killingsworth and Gilbert’s (2010) free application, Track Your Happi- 
ness, is more similar to ESM. On registering for the service on the application’s website, 
participants complete a brief measure of global happiness and provide demographic infor- 
mation. They also indicate their preferences for the ESM portion of the service, including 
how frequently they want to be signaled (three times per day is the default), and in what 
12-hour period of time they prefer to receive the daily signals. Signals can take the form 
of a text message or e-mail, with each signal providing a link to a website that contains a 
questionnaire. Although the questionnaires vary slightly, they assess factors such as how 
participants feel in the moment, what kinds of activities they are engaging in, feelings of 
productivity, the extent to which participants are focused on the task at hand, whether 
they are alone or with others, and the quality of their sleep. After providing a minimum 
number of responses, participants can access a summary of their data on a correspond- 
ing website. Although this procedure may seem intrusive to some, many participants are 
likely to be motivated to access their own Personal Happiness Profile, available after a 
certain number of responses, to gain greater insight into how they spend their time and 
how they feel throughout the day. At the time of this writing, Killingsworth and Gilbert 
have received an estimated 190,000 responses from a diverse sample of over 5,000 people 
(Killingsworth, personal communication, June 9, 2010), suggesting that this approach is 
sufficiently motivating to participants, even without monetary compensation. 

A possible limitation of studies using such mobile or Web applications is that the 
participants are self-selected and may not be representative of the general public. After 
all, application users possess relatively expensive smartphones, are technologically savvy, 
and are motivated to gain insight into and increase their levels of well-being. Although 
the demographics of smartphone users have not yet been established, not surprisingly, 
higher levels of education and income have been found to characterize Web-based sam- 
ples (Gosling, Vazire, Srivastava, & John, 2004). However, as some researchers have con- 
vincingly argued, Web participants are usually sufficiently diverse and take the research 
quite seriously, even though they are doing the studies on their own time, unsupervised 
by an experimenter (Gosling et al., 2004). Although it is too soon to tell, such Web and 
smartphone applications, available for free or for a small fee, may shape the future of 
happiness research. 

Given the central role that interpersonal relationships play in happiness, another 
recent methodological advance that allows researchers to study the nature of everyday 
social interactions is worth noting. The Electronically Activated Recorder (EAR), an 
unobtrusive and reliable methodology that does not rely on self-report, can be used to 
examine the social interactions that characterize everyday life (see Mehl & Robbins, 
Chapter 10, this volume). For example, a recent study of conversational styles required 
participants to wear the EAR, which recorded 30 seconds of sound every 12.5 minutes, 
over the course of 4 days (Mehl, Vazire, Holleran, & Clark, 2010). Results indicated that 
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happier people, as measured by both a global and a single-item measure, were more likely 
to spend time discussing substantive topics than to make small talk. This study suggests 
that it is the quality, rather than the sheer frequency, of social interactions that matters 
most, thereby shedding new light on the robust relationship between interpersonal rela- 
tionships and happiness (Diener & Seligman, 2002; Krueger et al., 2001). 

Other recent developments in the unobtrusive study of everyday life include wireless 
sensing devices worn on the body to detect physical activity, room temperature, amount 
of light exposure, heart rate, and even positional data (as determined by a GPS locating 
device; e.g., Tapia, Intille, Lopez, & Larson, 2006; see Goodwin [Chapter 14] and Intille 
[Chapter 15], this volume). Combining these technologies with ESM data can provide 
potentially rich new insights into some of the more subtle or as yet unidentified predictors 
of everyday happiness. 

Finally, the geographic information system (GIS), a powerful computerized mapping 
software, has recently been used in conjunction with a phone survey to determine a posi- 
tive correlation between a geographic location’s population density and the self-reported 
SWB ratings of a location’s inhabitants (Davern & Chen, 2010). The authors conclude 
that the GIS has the potential to identify links between well-being and numerous other 
features that characterize a geographic location, such as proximity to services (e.g., public 
transportation, health care), crime rate, climate, and demographic makeup. Although 
it has received little attention from psychological scientists thus far, GIS seems particu- 
larly compatible with the recent call for research on broad, national indicators of SWB 
(Diener, Kesebir, & Lucas, 2008; Krueger et al., 2009). 


Conclusion 


In summary, global SWB measures have been found to be reliable and valid, and have 
provided positive psychology researchers a wealth of information about the causes, corre- 
lates, consequences, and stability of happiness. However, as described earlier, mounting 
evidence suggests that real-time measures, in the form of experience sampling or daily dia- 
ries, contribute unique and novel information about what people do and how they feel in 
their everyday lives. As work on happiness becomes integrated with national indicators of 
the quality of life (Diener et al., 2008), as positive psychological science becomes increas- 
ingly popularized, and—perhaps most important—as technology becomes increasingly 
accessible, these types of measures will arguably become much more common. 

In recent years, researchers’ understanding of happiness and other positive constructs 
has grown rapidly. As the field moves forward and as technology advances, positive psy- 
chologists should continue to complement rigorous laboratory research with a greater 
focus on what people do, think, and feel in their daily lives. 
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CHAPTER 32 


Health Psychology 


JOSHUA M. SMYTH 
KRISTIN E. HERON 


here is a long history of the study and practice of mind-body medicine (see Har- 

rington, 2009). Speaking broadly, this area of enquiry attempts to study the interplay 
of social, behavioral, psychological, and biological factors that influence health (Smyth 
& Stone, 2003). Such research is conducted at many “levels” and includes basic biologi- 
cal and physiological processes, as well as behavioral, social, psychological, cultural, and 
other factors. We use the term health psychology, although there are many other cognate 
disciplines and related nomenclatures (behavioral medicine, medical sociology, psychoso- 
matic medicine, etc.). The field encompasses a variety of topics, including stress and mod- 
erators of stress, health-enhancing (e.g., exercise, disease screenings, sunscreen use) and 
-compromising behaviors (e.g., smoking, sedentary lifestyle), the management of chronic 
illness, behavioral and psychological factors associated with specific diseases and condi- 
tions (e.g., heart disease, diabetes, AIDS, pain, cancer), the structure and use of health 
services, patient-provider relationships, and many other topics. Given this extraordinary 
range and variety of topics, health psychology is by necessity a multidisciplinary and 
integrative field. As such, research methods for studying daily life can be particularly 
useful for health psychology researchers because they allow for consideration of complex 
relationships among various outcomes, often as they unfold over time and in natural 
contexts (Smyth & Stone, 2003). 

Our goal in this chapter is to provide an overview of how research methods for 
studying daily life can be applied within the field of health psychology. We begin with a 
discussion of the specific advantages and challenges researchers might experience when 
using these methods in the field. We then provide an overview of the various ways in 
which daily assessment methods, particularly ecological momentary assessment (EMA), 
can be and have been used to address research questions in health psychology. Finally 
we conclude with a discussion of the innovative uses of daily assessment and intervention 
methods, particularly as advances in technology continue, as well as some future direc- 
tions of the field. 
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Advantages of Using Daily Methods in Health Psychology 


As has been discussed at length in previous chapters of this handbook (see Part I), there 
are several notable advantages of using daily and momentary assessment methods, 
namely, improved ecological validity, reduced retrospective recall biases, and the ability 
to examine more complex, dynamic relationships among variables. Many of the issues 
and potential concerns with more traditional assessment techniques that are raised in this 
handbook have implications for research in health psychology as well. 

Issues broadly related to ecological validity are particularly important in health psy- 
chology. Many important aspects of behavior and experience related to health and illness 
cannot be easily captured or emulated in a laboratory setting. For example, it is difficult 
to measure reliably or to capture low-frequency or dynamic events (be they symptoms, 
social interactions, thoughts, etc.) in a laboratory setting. Although it is possible to ask 
people to recall such events, this approach may be suboptimal (a topic covered in earlier 
chapters, and one we return to shortly). Processes that may be of particular interest can 
also be ethically or pragmatically problematic to create or induce in a controlled setting 
(e.g., marital discord, severe pain, depressive symptoms); in such cases, it may be possible 
to study their natural occurrence in daily life. 

Even when it is possible to study processes of interest in a controlled setting, there 
is often an implicit assumption that data from assessments in these “artificial” settings 
(in the sense of not reflecting a person’s normal context; e.g., hospitals, doctor’s office, 
research laboratory) are accurate reflections of the processes or relationships that occur 
in everyday life. In fact, this is not always the case. For example, in a well-documented 
phenomenon known as white-coat hypertension, some people have high blood pressure 
when measured in medical settings, but their ambulatory blood pressure measurements 
are normal (Pickering, Davidson, Gerin, & Schwartz, 2002). Conversely, in some people, 
blood pressure assessed in medical settings is normal, while ambulatory measurements are 
high (so-called masked hypertension; Pickering et al., 2002). Both of these cases are not 
explainable by biomedical or disease attributes, and the discrepancy between medical and 
real-world settings is attributed to the artificial nature of the medical setting assessment. 
This is, of course, not merely an academic or measurement issue—such discrepancies 
can lead to the mismanagement of the underlying issue; in this case, a subset of patients 
may be unnecessarily treated/medicated (white-coat hypertensives), whereas another set 
of patients may require treatment/medication but not receive it (masked hypertensives). 

Another concern noted previously (see Schwarz, Chapter 2, this volume) is that the 
pervasive use of heuristics in memory search and reconstructions can systematically bias 
participant responses obtained in global retrospective reporting. Although this method 
in its many forms (interviews, self-report measures, etc.) has several advantages, it may 
be suboptimal for some questions in health psychology. In particular, the recall of certain 
aspects of behavior and experience are problematic: Dynamic and complex states, such 
as affect and pain; attempts to recall complex temporal structuring (including inferences 
about antecedents and consequents); and summarizing events or behaviors are areas 
where global recall can be inaccurate and/or biased. For example, in a study assessing 
cigarette consumption among smokers, the number of cigarettes smoked was assessed 
using electronic diaries and EMA for 1 week, as well as a timeline followback method 
that asked participants to recall the number of cigarettes smoked during the previous 
week (Shiffman, 2009). Results showed these two methods of assessment were only mod- 
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estly correlated (r = .29). Moreover, and perhaps not surprisingly, when asked to recall 
the number of cigarettes smoked, people tended to round the number of cigarettes they 
reported smoking (e.g., to the nearest 10). These data illustrate the potential biasing of 
assessments that require recall of past behaviors and suggest that if precise estimates of 
a health behavior are needed, daily assessment methods are likely superior to methods 
requiring recall over extended periods of time. 

In the absence of objective measures of illness, many areas of health psychology 
research must rely on self-reported symptoms (e.g., pain, fatigue, body dissatisfaction). 
There is, however, evidence to suggest that subjective global assessments of physical symp- 
toms (and other experiential states, e.g., mood) may not accurately reflect the everyday 
experience of these symptoms. In studies of patients with chronic pain (Stone, Schwartz, 
Broderick, & Shiffman, 2005) and chronic fatigue syndrome (Sohl & Friedberg, 2008), 
for example, data suggested that patients who reported greater variability in their daily 
symptoms (i.e., pain or fatigue, respectively), were less accurate in recall of their average 
symptoms. These findings suggest that health psychology researchers must carefully con- 
sider whether and how retrospective recall biases may influence assessment outcome and 
identify whether daily (or more fine-grained) assessment methods may provide a more 
accurate measure of the variable(s) of interest. 

Another advantage of daily assessment methods in research is that temporal relation- 
ships among variables can be explored, because multiple assessments are collected over 
time. Such temporally rich data allow researchers to address more nuanced questions 
about dynamic associations between health-related variables and processes that occur 
over time. Temporal questions may be addressed with different degrees of granularity, 
from moments to days to longer developmental intervals. By collecting multiple assess- 
ments throughout the day, researchers can address questions regarding the psycholog- 
ical, social, emotional, physical, and contextual precursors and sequelae to a specific 
event (e.g., drug craving, binge-eating episode), in real time and in natural settings. For 
example, in an EMA study of patients with bulimia nervosa, the effect of time of day 
and day of the week on eating disorder behaviors (bingeing, vomiting) was evaluated in 
the natural environment using an EMA protocol (Smyth et al., 2009). Findings suggested 
that binge-eating and vomiting events increased throughout the day, peaking in the early 
afternoon and late evening; the probability of bingeing and vomiting was also higher on 
weekends compared to weekdays. This study provides a simple example of the types of 
temporal research questions that can be addressed using an intensive sampling schedule, 
in this case one including six randomly prompted assessments throughout the day (signal- 
contingent), an end-of-day report (interval-contingent), and assessments following binge- 
ing or vomiting events (event-contingent). 

It is important to note, however, that some research questions do not require mul- 
tiple daily assessments, and a daily diary approach may be adequate. For instance, Grov, 
Golub, Mustanski, and Parsons (2010) used an online daily diary method for 30 days 
to assess affect and sexual risk behavior among men who have sex with men. Results 
showed that increased negative affect (fear, sadness, anger, disgust) was associated with 
reduced sexual risk behavior, and sexual activation was associated with greater sexual 
risk taking. Daily assessment methods can also be coupled with more traditional longi- 
tudinal assessment designs to assess longer developmental intervals. Measurement burst 
studies allow for collection of relatively intensive “bursts” of assessments over short peri- 
ods of time (hours, days, weeks) that are then be repeated for the same individuals over 
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longer time intervals (e.g., months, years; see Sliwinski, 2008). For example, Sliwinski, 
Almeida, Smyth, and Stawski (2009) examined data from two measurement burst diary 
studies that assessed emotional reactivity to daily stress across the lifespan. In these stud- 
ies a burst of daily assessments was collected, then repeated after a 10-year interval 
(Study 1) or every 6 months for 2 years (Study 2). This approach allows the characteriza- 
tion of both within-person and between-person variability and change over time in daily 
stress reactivity. By using these two studies, with their similar methodology but different 
time frames, results cover both relatively long (10-year) and short (2-year) developmental 
periods. Interestingly, although there were relatively stable aspects to individuals’ stress 
reactivity, stress reactivity on average increased over time over the adult lifespan. More- 
over, this approach revealed that there was also substantial within-person variability in 
stress reactivity (e.g., individuals showed greater daily stress reactivity during times of 
greater ambient/global stress). 

Such daily and within-day approaches to data capture (be they EMA, daily diary, 
measurement burst, or some alternative) can allow the dynamic, within-person associa- 
tion between variables of interest to be characterized over longer developmental periods. 
It is again worth noting that such variables can include self-report, behaviors, passive or 
active observation, biomarkers or other indicators of disease status, and so forth. With 
some ingenuity, most relevant variables can be integrated into a more intensive data cap- 
ture methodology. This approach represents an important improvement over the use of 
cross-sectional data (where, of course, no temporal sequencing can be demonstrated), 
and—at least in some methodologies, such as EMA—can provide data capture of suf- 
ficient density to “drill down” to the level of temporal analysis desired (be that days, 
hours, or minutes). Thus, a range of questions such as understanding response latencies, 
recovery times, effect durations, recursive dynamics, and so forth, can be examined in 
ways not possible with other data capture approaches. 


Challenges of Using Daily Methods in Health Psychology 


Although we laud the many benefits of using daily assessment methods in health psychol- 
ogy research, the challenges of using such techniques should be noted. As has been dis- 
cussed in other chapters of this volume, there are important study design considerations 
that researchers must take into account. Methods for studying daily life can be quite com- 
plex to implement, requiring that a number of important decisions be made in advance of 
starting a study. In that sense, the “upfront” or “startup” costs for daily methods studies 
can be high. For example, researchers must identify a suitable study design to address 
the research question, identify a sampling scheme (e.g., daily, within-day intensively), 
and consider questionnaire design. Studies using EMA methods can further require 
more intensive participant training and monitoring in the field for problems. Training 
and monitoring are especially important for studies involving electronic diaries or other 
participant-driven technologies. Participation in EMA studies can also require substan- 
tial time and effort on the part of the participants, often requiring them periodically to 
take time out of their day for study activities (e.g., complete assessments, collect biologi- 
cal samples). Researchers can encourage participant compliance with the study protocol 
by ensuring that participants are adequately trained (i.e., understand what they need to 
do), using materials (e.g., hardware, software) that allow for a user-friendly interface and 
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provide cuing or reminders (e.g., alarms). In addition, creating a sense of accountability 
(e.g., electronic tracking) and rewarding cooperation can also help to improve compliance 
(see Hufford, 2007, for a review of compliance with EMA protocols). 

The potential challenges of using daily assessment methods described earlier may 
be complicated even further when implemented in health research. In large part this 
can arise from functional or other limitations of studying patient groups. For example, 
in planning a study using an EMA methodology for examining the relationship among 
symptoms, mood, and quality of life in patients undergoing chemotherapy for breast 
cancer, careful thought should be given to identifying the most appropriate sampling 
schedule. Although, from a research perspective, using an intensive sampling strategy 
(e.g., five to seven assessments per day) may be desirable, participant burden for this 
sample may be too high and thus unrealistic. In such cases, less burdensome approaches 
may be preferable (passive approaches; e.g., observational or automated data collection, 
or using some variation on daily diary self-report approaches; see Gunthert and Wenze, 
Chapter 8, this volume). 

Practical issues in conducting studies with samples with acute and chronic illness can 
also pose challenges for daily assessment studies. When choosing hardware and software 
for such studies, sample-specific considerations can be an especially important deciding 
factor. For example, in a study of patients with rheumatoid arthritis, some of whom 
may have limited hand dexterity, available hardware options and potential adaptations 
(e.g., using a larger stylus or touch screen to respond on a palmtop computer) must be 
carefully considered. Similar challenges can also arise with computerized software in 
specific samples, requiring modifications or adaptations to the assessment protocol. For 
example, vision difficulties in older adults participating in a cell phone-based EMA study 
of exercise habits may require researchers to identify software that supports larger fonts 
to display questions on the cell phone for those older adults with visual impairments (or 
even consider alternative, more adaptive, technology to administer the study; e.g., voice 
recognition/recording to capture spoken responses). 

Citing these potential challenges, we are by no means suggesting that studies using 
daily assessment methods cannot, or should not, be used with specific populations or 
patient groups. They do, however, serve to illustrate the ways that researchers applying 
these methods to address health-related questions need to consider carefully and adapt the 
assessment methodology so that it can be appropriately implemented (both technically, to 
improve feasibility and compliance, as well as responsibly, so as to not overburden study 
participants). Of note, when using patient groups or other highly selected samples, there 
is no simple “off the shelf” implementation strategy for daily (or within-day) assessment 
methodology. Great care must be given to ideographically design and implement methods 
unique to the research question and sample at hand. 


What Types of Research Questions in Health Psychology 
Can Be Addressed with Daily Assessment Methods? 


In the following section, we provide an overview of methodological approaches to study- 
ing daily life that have examined various health psychology topics. More specifically, we 
highlight the use of daily assessment methods (1) in large-scale observational studies, (2) 
as a way to examine within-person differences across time, (3) to examine momentary 
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associations and temporal sequencing, (4) to understand behavior patterns across indi- 
vidual differences, (5) in psychophysiological assessment, and (6) to evaluate treatment 
processes or outcomes. This is by no means an exhaustive review of all relevant studies; 
instead, our aim is to illustrate some of the unique ways daily assessment methods may 
be fruitfully applied in health psychology. 


Large-Scale Observational Studies 


Historically, it was not feasible to assess health-related factors or behaviors in a large 
number (hundreds or thousands) of people using daily assessment methods given practi- 
cal and logistical constraints. As technology advances, however, it is now possible to use 
daily assessment methods to collect information on people’s everyday health-enhancing 
(e.g., healthy eating, exercise) and health-compromising (e.g., smoking, sun exposure, 
sedentary activities) behaviors. Similar to retrospective assessment methods, these data 
can provide information about the frequency, duration, and relationships among various 
behaviors and experiences as they occur in daily life. 

By way of example, in a study of the social and physical contexts of physical activ- 
ity, 502 adolescents between the 9th and 12th grades (ages 14-18) recorded their activi- 
ties, social company, and location twice each year (Dunton, Whalen, Jamner, & Floro, 
2007). Using palmtop computers, participants completed the brief assessment every 30 
minutes for a 4-day period (Thursday to Sunday). This study provides another example 
of a way that daily assessment methods can be combine with more traditional repeated 
measures called longitudinal designs (a design similar in many ways to the measurement 
burst approaches described earlier in this chapter). Results suggested that greater propor- 
tions of physical activity occurred with friends, outside, and at school, with boys report- 
ing greater outdoor activity than girls. Across the 4-year assessment period, exercising 
with classmates, friends, and at school decreased. Activity at schools was more frequent 
on school days than on weekends, and physical activity outdoors was reported more 
frequently in the fall and spring months than in winter months. Overall these findings 
illustrated how adolescents’ gender and grade, as well as environmental characteristics 
(i.e., day of the week, season), are associated with physical activity level. Similar studies 
conducted in the United Kingdom have assessed patterns of physical activity and seden- 
tary behaviors in large samples (more than 1,300 participants) of adolescents (Gorely, 
Marshall, Biddle, & Cameron, 2007a, 2007b). The information gathered from these 
intensive studies of everyday activities and experiences can be useful for identifying the 
correlates of sedentary behavior, so that appropriate interventions can be developed. 

As mobile technology use becomes more widespread, large-scale observational stud- 
ies using daily assessment techniques are becoming a more realistic alternative to tradi- 
tional survey studies. For example, in a study of physical activity of middle-aged adults, 
participants used their mobile telephones to record their engagement in physical activity 
throughout the day, and they sent data back remotely to the researchers via an Internet 
connection. Objective measures of activity (e.g., through integrated actigraphy, or cre- 
ative use of the GPS capabilities of mobile phones) could also potentially be collected 
and remotely transmitted as well. Although such a study design would require consider- 
ably more “upfront” costs (financial, time, expertise) than traditional paper survey meth- 
ods, the resulting dataset would allow researchers to address more complex and nuanced 
research questions. 
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Day-Level Variables 


Daily diary methods involve once-daily assessments (usually near the end of the day) 
repeated over time, resulting in a relatively undemanding sampling approach. This assess- 
ment approach is discussed extensively elsewhere in this volume (see Gunthert & Wenze, 
Chapter 8); thus, we do not review it in detail here. However, many research questions in 
health psychology can benefit from techniques that collect information about daily varia- 
tions in events or experiences over time. For example, in a smoking cessation study, Van 
Zundert, Ferguson, Shiffman, and Engels (2006) asked people who were attempting to 
quit smoking to rate their self-efficacy for quitting and any lapses or relapses each day for 
1 week before and 3 weeks after a quit attempt. They found that lower self-efficacy on a 
given day predicted lapses and relapses on the subsequent day. This study design allowed 
the researchers to capture information and characterize the “type” of day that people 
were having (e.g., a low self-efficacy day, a lapse day) and use this to predict events on 
subsequent days. Such a design provides information about the more dynamic relation- 
ships between health variables as they occur across days in the real world. 


Momentary Associations and Temporal Sequencing 


As has already been demonstrated in prior examples, intensive sampling strategies that 
collect multiple assessments (across days or throughout a day) can allow researchers to 
begin to answer questions about the momentary associations between variables. Whether 
questions can be addressed using a study design with relatively few assessment points 
(e.g., three self-report questionnaires about eating habits for 1 week) or more intensive 
self-report, physiological measurement, or automatic recording of events or behavior 
(e.g., heart rate measured using an ambulatory monitor every 15 seconds for 4 days), 
such data can address questions regarding how various health behaviors or symptoms 
are associated with one another within individuals in real time. For example, Dunton, 
Atienza, Castro, and King (2009) used EMA to identify the social, cognitive, and affec- 
tive factors that influence physical activity in middle-aged adults. Participants recorded 
their moderate-to-vigorous physical activity, affect, self-efficacy, control, fatigue, energy, 
and social interactions on a palmtop computer in response to four random prompts 
each day for 2 weeks. They found that at times (moments) when people reported a 
positive social interaction, they were more likely also to report engaging in moderate- 
to-vigorous physical activity. If physiological data are collected concurrently with self- 
reports, temporal linkages between self-reported data and ambulatory physiology can 
also be explored. For example, reported stress can be linked to increases in ambulatory 
physiology (cortisol, blood pressure, etc.; Smyth et al., 1998; Vella, Kamarck, & Shiff- 
man, 2008). These possibilities are discussed in more detail below and elsewhere in this 
book (see Part II). 

An important benefit of using assessment methods that collect multiple daily mea- 
surements is that the temporal relationships among variables can be explored. Such data 
allow researchers to address more complex and nuanced research and clinical health 
questions about dynamic associations and processes that occur over time. Data allow- 
ing for temporal sequencing analyses can be collected in several ways. Research areas in 
which discrete events or behaviors of interest can be identified (e.g., drug craving, binge 
episode, exercise episode) are potentially well suited for such research designs. Event- 
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contingent sampling methods (discussed in detail by Moskowitz & Sadikaj, Chapter 9, 
this volume) can then be used alone or combined with daily diary methods to collect 
information on the social, emotional, psychological, or physical experiences of people 
leading up to, or as consequences of, the events or behaviors of interest. For example, in 
a study of patients with bulimia nervosa, the affective precursors and consequences of 
bingeing and vomiting episodes were examined with an EMA research design (Smyth et 
al., 2007). Participants used a palmtop computer to record bingeing and vomiting epi- 
sodes. They were also prompted six times daily to complete assessments of mood, and 
were again asked about any bingeing/vomiting episodes they may not have spontaneously 
reported since the previous assessment. Results showed that in the hours leading up to a 
bingeing or vomiting episode, women reported decreasing positive affect and increasing 
negative affect and anger/hostility. Conversely, in the hours after the event, positive affect 
steadily increased and negative affect and anger/hostility decreased (Smyth et al., 2007). 
Studies using similar designs (i.e., including both event- and signal-contingent assessment 
protocols) have been used to examine mood and contextual factors in real time that lead 
up to cocaine and heroin craving and use (Epstein et al., 2009; Preston et al., 2009) and 
smoking cessation lapses (Stone et al., 2005; Van Zundert et al., 2006). 

In addition to identifying precursors and sequelae of discrete events or behaviors, 
multiple daily assessments can examine changes from one assessment to the next (i.e., 
over the course of hours). For example, Dunton and colleagues (2009) used EMA to iden- 
tify both concurrent and prospective factors that influence physical activity in middle- 
aged adults. The authors found not only that social interactions were associated with con- 
current physical activity but also that greater self-efficacy and control at one assessment 
prospectively predicted moderate-to-vigorous physical activity at the next assessment 
several hours later. Momentary data collection, whether organized around a discrete 
event or using a signal-contingent protocol, can address complex, time-dependent ques- 
tions about the dynamic relationships among variables. Moreover, if the data capture is 
sufficiently “dense” (i.e., has a sufficient number of observations), it becomes possible to 
model change, relationships between variables, and so forth, in a dynamic fashion (allow- 
ing estimates of nonlinear processes, characterization of variability, recursive and other 
feedback effects, etc.). 


Individual Differences and Studying Daily Life 


Health psychology researchers are often interested in understanding the ways in which 
person-level characteristics or individual differences influence health-related behaviors 
or illness processes. Momentary and daily assessment methods can be, and are, quite 
commonly incorporated into the study of person-level factors. For example, one might 
be interested in how an individual-difference characteristic (e.g., neuroticism) or demo- 
graphic variable (e.g., age) influences smoking behavior assessed in everyday life. In an 
EMA study of adults with asthma or rheumatoid arthritis, Juth, Smyth, and Santuzzi 
(2008) examined the role of self-esteem on reports of mood, stress, and physical symp- 
toms assessed in everyday life. Patients completed an initial (paper) assessment of self- 
esteem, then completed five daily assessments of mood, stress, and diseases symptoms 
(coughing and wheezing for asthma patients, pain for rheumatoid arthritis patients) on a 
palmtop computer for 1 week. They found that patients with lower self-esteem reported 
more negative affect and less positive affect, rated stressors as more severe, and had more 


Health Psychology 577 


severe physical symptoms than patients with higher self-esteem. This methodology of 
combining assessment methods can provide insight into how stable person-level factors 
might influence the (within-person) daily experiences of disease processes (or behaviors, 
social interactions, etc.). 

In many cases individual differences are also thought to moderate relationships 
among variables, including those that may be assessed in daily life. As has been dis- 
cussed, daily and momentary assessment methods are particularly useful for examining 
dynamic relationships among variables in real time. These relationships, however, can 
also be moderated by relatively stable person-level factors (i.e., individual differences). 
For example, Vella and colleagues (2008) examined the role of hostility in moderating 
the relationship between social interactions and ambulatory blood pressure (ABP). Par- 
ticipants first completed a measure of hostility, then had their blood pressure monitored 
every 45 minutes for 6 days with an ABP monitor. At the time of the ABP reading, they 
also used an electronic diary to report on their social interactions and mood. Results indi- 
cated that individuals with low hostility experienced reductions in ambulatory diastolic 
blood pressure during social interactions rated as more intimate, whereas this effect was 
not observed in more hostile participants. Conversely, high-hostility people experienced 
an increase in diastolic blood pressure during situations rated high in social support. 
These findings suggest that hostile people may experience (and navigate) social interac- 
tions differently from people low in hostility, and that such differences have real-time 
physiological implications. 

Researchers are increasingly interested in how individual differences in genetic 
makeup may be associated with health-related behaviors and processes (see Way & Gur- 
baxani, 2008). In particular, genetic variables (generally measured as fixed person-level 
variables) could modulate relationships between variables assessed using momentary and/ 
or daily assessment methods. In an excellent example of the potential of such approaches, 
Gunthert and colleagues (2007) assessed the serotonin transporter gene polymorphism 
(S-HTTLPR) and anxiety reactivity in daily life. College student participants (N = 350) 
completed one Web-based daily assessment of stressors and mood for two 30-day peri- 
ods, separated by 1 year. Participants in the study also provided a DNA sample, which 
was used to identify individuals with at least one copy of the short or long allele of the 
5-HTTLPR. On days when people reported experiencing more severe stress, Sand Lg 
allele carriers reported greater anxiety compared with people homozygous for the L, 
allele. These findings suggest that allelic variation in S-HTTLPR acts as a genetic diath- 
esis for experiencing anxiety in response to daily stressors. More broadly, these examples 
illustrate the importance of considering how person-level variables (notably, genetic vari- 
ables, but also including other stable individual differences) may moderate dynamic rela- 
tionships that are assessed in everyday life. This study also provides another example of 
a variant on the measurement burst design, in which daily assessment methods (i.e., daily 
assessment for 30 days) are coupled with longitudinal assessment (repeated at 1 year). 


Psychophysiological Assessment in Health Psychology 


Another promising aspect of technology-based ambulatory assessment involves the collec- 
tion of real-time physiological data. A variety of measures (e.g., skin conductance, heart 
rate, blood pressure, respiration, blood glucose, and body motion) can be collected using 
small electronic devices that can often be worn discreetly on the body without detection 
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or substantive interference with daily activities (Ebner-Priemer & Kubiak, 2007; Morris, 
Digital Health Group, & Intel Corporation, 2007; Patrick, Intille, & Zabinski, 2005). 
Monitoring devices can take many forms, such as blood pressure cuffs, watches, rings, or 
pendants to monitor physical movements, and wireless chest straps or clothing designed 
to monitor heart rate, respiration, and/or blood glucose. These ambulatory physiological 
tools generally allow continuous monitoring, without requiring the user to do anything. 
These assessment methods have been covered extensively in other sections of this volume 
(see Schlotz, Chapter 11; F. Wilhelm, Grossman, & Müller, Chapter 12; Bussmann & 
Ebner-Priemer, Chapter 13), and the practical aspects of such measurement are not dis- 
cussed in depth here. This current and emerging technology, however, has many impor- 
tant applications for real-time assessment in health psychology. 

Given the well-established role of psychological, neuroendocrine, and physiological 
processes in contributing to illness (e.g., Miller, Chen, & Cole, 2009), ambulatory psy- 
chophysiological monitoring is increasingly being utilized in health psychology research. 
Research examining the association between momentary self-report data (experiences, 
events, mood, etc.) and ambulatory psychoneuroendocrinology (e.g., cortisol, glucose), 
physiology (e.g., blood pressure, heart rate, electrodermal activity), and activity (e.g., 
motion, posture) monitoring is now common in the field. For instance, several stud- 
ies have demonstrated the momentary association between subjective ratings of stres- 
sor occurrence and negative mood, and the real-time influence on salivary cortisol (e.g., 
Jacobs et al., 2007; Smyth et al., 1998). In these studies, participants are typically signaled 
several times each day with a programmed electronic device (e.g., wristwatch, beeper, 
palmtop computer), and asked to complete measures of stress and mood, and provide a 
saliva sample using a cotton swab and Salivette. In general, these studies find that stress 
and negative mood are associated with higher cortisol levels (and, although less strongly, 
positive mood with lower cortisol). Such ambulatory studies, whether examining psy- 
choneuroendocrine function or other physiological systems, provide evidence of putative 
mechanisms whereby stress (and other psychosocial characteristics) can “get under the 
skin” and affect physiological processes that in turn are known to impact important 
disease processes (e.g., Cohen, Janicki-Deverts, & Miller, 2007; Miller, Chen, & Cole, 
2009). Such evidence is a necessary and important step in building a more thorough 
framework that links daily experiences to health and disease. 


Evaluating Interventions 


Daily assessment methods can be used as a tool to evaluate psychosocial and health 
behavior interventions and/or to understand potential mechanisms of change. Given the 
intensive assessment schedule inherent in these methods, this provides an opportunity 
to collect a plethora of information before, during, and after clinical interventions. Such 
data provide great temporal resolution to the changes in outcome as a result of the inter- 
vention, allowing for earlier detection of change and improved ability to identify the 
nature of the change (gradual, stepwise, etc.). This was shown to be the case in a study 
of antidepressant treatment response; standard weekly clinic assessments were compared 
to daily diary assessments to determine whether it was possible to detect antidepressant 
response more quickly with daily assessment compared to standard weekly assessment 
(Lenderking et al., 2008). Participants taking fluoxetine were randomized to standard 
weekly clinic assessments or standard and daily assessments, and all participants were 
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followed for 28 days. Results suggested treatment effects could be detected sooner using 
daily diaries compared to standard weekly clinic assessments. Furthermore, neither rat- 
ings of perceived burden nor dropout rates suggested that completing daily diaries was 
significantly burdensome to patients (Lenderking et al., 2008). 

Using repeated daily assessments of variables of interest and potential mechanisms 
of change prior to, during, and following the intervention, researchers can more care- 
fully observe and track the change process as it occurs. This approach can help to illu- 
minate and test hypothesized mechanisms of change. For example, EMA has been used 
to assess patients in a smoking cessation intervention in an effort to understand better 
the mechanism whereby the treatment has an effect (Ferguson, Shiffman, & Gwaltney, 
2006; McCarthy et al., 2008). In one study (Ferguson et al., 2006), smokers were ran- 
domized to receive either nicotine replacement therapy (NRT; high-dose nicotine patch) 
or a placebo. Using palmtop computers, they reported real-time craving and withdrawal 
symptoms. The goal of this study was to test the widely accepted hypothesis that NRT 
reduces nicotine craving and withdrawal, thus improving smoking cessation outcome. 
Results showed that NRT did decrease withdrawal and craving symptom severity, but 
this reduction only partially accounted for NRT’s impact on the time to first lapse. These 
findings suggest that although NRT may reduce nicotine cravings and withdrawal in real 
time, other mechanisms for the effectiveness of NRT in smoking cessation interventions 
should also be explored. Stone, Smyth, Kaell, and Hurewitz (2000) have also used EMA 
methods to examine possible mediators of treatment outcome for an expressive writing 
intervention (i.e., structured writing exercise about stressful events) for patients with rheu- 
matoid arthritis or asthma. Participants monitored perceived stress, sleep quality, affect, 
substance used, and medication use for 1 week before, during, and 2 weeks following 
the expressive writing intervention. There was no evidence that short-term alterations in 
these variables mediated treatment effects, suggesting that additional research examining 
the mechanism of change for expressive writing interventions is needed. Overall, these 
studies illustrate how daily assessment methods could be useful for identifying potential 
mechanisms of change for psychosocial and pharmacological treatment. 


Ecological Momentary Intervention Methods 
in Health Psychology 


Although our purpose in this chapter is to discuss the use of daily assessment methods 
in health psychology, the integration of real-time assessment and intervention for health 
behavior change is becoming increasingly popular and warrants further discussion. So- 
called ecological momentary interventions [EMI] are characterized by the delivery of 
interventions to people as they go about their daily lives (Heron & Smyth, 2010; Patrick 
et al., 2005). Similar to many daily assessment methods, EMI occur in the natural envi- 
ronment (thus providing a more ecologically valid intervention), and are delivered at spe- 
cific moments in everyday life. As has been discussed throughout this volume, advances in 
mobile technology have facilitated the use of daily assessment and intervention methods. 
To date, mobile technology (i.e., palmtop computers, mobile telephones) has been used 
to deliver interventions in patients’ everyday lives for a variety of conditions and health 
behaviors, including healthy eating and physical activity, weight loss, smoking cessation, 
and alcohol use (for a review, see Heron & Smyth, 2010). 
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EMI studies implement interventions in a variety of different ways; however, they 
predominantly provide EMI as only one component of the treatment. The treatment pro- 
tocol generally consisted of EMI provided in addition to cognitive-behavioral therapy 
(CBT) or interactive website interventions. For example, in studies using CBT and EMI 
for weight loss, palmtop computer-based treatment modules allowed participants to 
track caloric intake and physical activity, facilitated goal setting, encouraged the use of 
behavioral weight loss strategies discussed in the treatment groups, and provided some 
limited individual feedback. There is preliminary evidence that such EMI methods may 
be an efficacious means of providing weight loss treatment to overweight women (Agras, 
Taylor, Feldman, Losch, & Burnett, 1990; Burnett, Taylor, & Agras, 1985, 1992). 

As cellular telephone use becomes more widespread, the opportunity for using this 
technology to deliver health behavior EMI to hundreds or thousands of people is becom- 
ing a reality. For instance, Rodgers and colleagues (2005) tested the efficacy of a smoking 
cessation intervention delivered via mobile phone-based text messaging to a large sample 
(N = 1,705) of smokers in New Zealand. Participants received five daily text messages 
around their quit day and continued to receive three messages per week for the following 
6 months. Participants also had access to a website and could send text messages to a 
designated number and immediately receive tips for coping with cravings. People receiv- 
ing the EMI were significantly more likely to quit smoking than control participants 6 
weeks (28 vs. 13%) and 12 weeks (29 vs. 19%) after beginning the study, but not at 26 
weeks (25 vs. 24%). Two other studies showed similar short-term treatment gains during 
a smoking cessation intervention (Brendryen & Kraft, 2007; Obermayer, Riley, Asif, & 
Jean-Mary, 2004). Interventions such as these that rely on automated systems (e.g., those 
administered via websites and using mobile technology) have the potential to reach a 
large number of people without requiring a great deal of resources (clinician or researcher 
time, equipment, etc.), and are thus of great interest from the perspectives of efficiency, 
cost-effectiveness, and reach. 

An area of increasing interest in daily assessment and intervention for health behav- 
ior treatments is the development of individually content- and time-tailored interventions. 
Although tailoring the content of interventions to participant characteristics is fairly 
commonplace, EMI protocols allow for timing of the delivery of EMI to be individually 
tailored as well (although less well studied to date, allowing person-by-moment tailor- 
ing, that is, providing targeted information to specific people differently as a function of 
momentary characteristics!). For example, in a smoking cessation program, EMI can be 
delivered as text messages to participants’ cell phones at specific times when they typi- 
cally smoke (Lazev, Vidrine, Arduino, & Gritz, 2004; Obermayer et al., 2004; Rodgers 
et al., 2005; Vidrine, Arduino, & Gritz, 2006). EMI can also be based on people’s recent 
or concurrent EMA reports. A few previous studies requested that participants periodi- 
cally complete EMA regarding their current affective state (e.g., anxiety level), or behav- 
iors (e.g., calories consumed), and provided EMI only at times when they were reporting 
problematic behaviors or experiences (Agras et al., 1990; Burnett et al., 1985, 1992; 
Newman, Consoli, & Taylor, 1999). Several studies have demonstrated that designing 
a treatment protocol that automatically delivers a tailored EMI based on concurrent or 
recent EMA is feasible. To date, however, research testing whether such delivery systems 
are more efficacious than traditional formats is sparse, and this remains a priority area 
for future research. 
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Time-tailored EMI treatment protocols appear to be especially useful for condi- 
tions or behaviors with distinct antecedent states (e.g., cravings/urges, experience of 
negative emotions) or events (e.g., stressors, mealtimes, particular social situations) for 
which EMA assessments and interventions can be developed. As ambulatory physiologi- 
cal assessment devices (e.g., heart rate, blood pressure; Asada, Shaltis, Reisner, Rhee, & 
Hutchinson, 2003; Ebner-Priemer & Kubiak, 2007), and environmental sensors (e.g., 
location, presence of others; Gandy, Starner, Auxier, & Ashbrook, 2000; Patrick et al., 
2005) become more sophisticated, people’s momentary physiological states and environ- 
mental conditions will also be used to trigger and to tailor real-time interventions. For 
example, Morris and colleagues (2007) are developing technology to monitor health sta- 
tus in everyday contexts and translate it into personalized feedback delivered to patients 
in real time. Although these systems have yet to be evaluated, there is great potential to 
create interactive and dynamic treatments for use in everyday life that are sensitive to 
internal and external cues. 


Summary and Conclusion 


We hope that this brief summary captures the potential benefits of daily and within- 
day approaches to research in health psychology. Many of the general benefits of daily 
and within-day methods, including EMA, certainly apply to health psychology as well. 
These include, but are not limited to, data capture benefits, such as the reduction of recall 
and reconstruction biases, as well as greatly enhanced ecological validity. Moreover, as 
technology advances, the range of research and clinical possibilities is expanding rapidly 
as well. The use of improved handheld computing and cellular technology (to capture 
and store different types of data, to record and transmit video, etc.) is opening up new 
possibilities. Integrated assessment systems that capture other data streams in real time 
(physiology and biomarkers, physical activity and location, social behaviors and proxim- 
ity, etc.) will allow researchers to address increasingly diverse and theoretically rich ques- 
tions. Such approaches will allow researchers to build models across both relatively time- 
invariant (e.g., genetic) and time-varying (e.g., affective, social) processes, and across 
multiple levels of assessment and analysis (within- and between-person, short- and long- 
term temporal processes, nested in different social and cultural contexts, etc.). There is 
also the very exciting potential for the linkages of ambulatory data capture, particularly 
in real time, with intervention research. We believe that this work will allow researchers 
and clinicians to administer “psychologically personalized medicine” tailored not only to 
(stable) person characteristics but also to moments of dynamically assessed risk (psycho- 
logical, behavioral, physiologic) within persons over time. Overall, the multidisciplinary 
and integrative nature of health psychology research is perfectly poised to leverage the 
many advantages of research methods that allow the detailed and dynamic study of the 
person in time and in context. 
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CHAPTER 33 


Developmental Psychology 


JOEL M. HEKTNER 


he development of persons is not often perceptible from one moment to the next, or 

even from one day to the next. Thus, the pillars of developmental science are longitu- 
dinal studies spanning years. Nevertheless, micro-longitudinal studies of daily life that 
produce intensive longitudinal data have been invaluable to the study of development 
across most of the lifespan. Beginning with Csikszentmihalyi and Larson’s (1984) land- 
mark study of the daily life of adolescents, developmental psychologists have used the 
experience sampling method (ESM) and its variants to capture the contexts of develop- 
ment, and the relations between those contexts and the behavior and inner psychological 
life of people. 

Developmental researchers have always taken context seriously, but they have also 
striven for methodological rigor, a frequent casualty when moving out of the laboratory. 
There is also a growing recognition of and interest in variations within people across time 
and situations. People do not all develop at the same rate or in the same direction, and 
some may have episodes of regressions followed by spurts of growth. Only research meth- 
ods designed for studying daily life combine the “real-world,” “real-time,” and “within- 
person” perspectives in a methodologically rigorous way to satisfy many of the needs of 
the field of developmental science. The growing importance of such studies to the field 
is evident from their ever-increasing number and wider range (for a recent review, see 
Hoppmann & Riediger, 2009). 

These studies of daily life have been conducted with people ranging from childhood 
to old age. In the sections that follow, periods of the lifespan provide an organizational 
structure to a targeted review of the relevant literature. This is done, not to imply that the 
developmental questions and issues are different in each period—indeed, several themes 
recur throughout the lifespan—but because nearly all of this research is confined to a 
single developmental period, and most developmental psychologists still specialize in just 
one part of the lifespan. 


Studies of Children 


Because of the obvious methodological obstacles, studies of the daily life of infants, tod- 
dlers, and young children that attempt to capture their momentary inner experience do 
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not exist. However, researchers have made strides in capturing some elements of the daily 
life of very young children. For example, Warren and colleagues (2010) used a vocal 
analysis system called LENA (language environment analysis) to record an entire day of 
vocal interactions between toddlers and their caregivers. The children wore the record- 
ing device in a chest pocket of clothing specifically designed to hold it. Although adults 
interacting with children with autism spectrum disorder (ASD) spoke to them just as 
frequently as adults interacting with typically developing children, the children with ASD 
had fewer conversational turns and engaged in more monologues. The LENA system 
appears to be a valuable tool for capturing both a child’s own vocalizations and the oral 
language in the environment. 

Age 7, when children are old enough to attend school and can nominally read and 
write, appears to be the youngest age at which researchers have used methods that cap- 
ture inner experience in daily life. Manning (1990) asked elementary school students ages 
7-11 to record their self-talk in a journal two times each morning during independent 
seatwork over a period of 20 school days in response to a timer set randomly by their 
classroom teacher. Their open-ended responses were coded along three different dimen- 
sions. Children’s behavior, as rated by their teacher on a positive-negative scale, was cor- 
related with the positivity of their self-talk. Children with higher IQs and higher school 
achievement recorded more neutral self-talk, whereas those with lower IQs had more 
negative self-talk statements. Although this study was narrowly focused on the classroom 
context, the method enabled examination of links among momentary inner experience, 
behavior, personal characteristics, and longer-term academic outcomes. 

Negative self-talk (i.e., self-criticism), along with greater dependency, was found in 
another study to be related to higher levels of depressive symptoms following negative 
events. Adams, Abela, Auerbach, and Skitch (2009) used handheld computers to admin- 
ister two questionnaires to children ages 7-14 once a week for 6 weeks. The computer 
read the questions aloud to the children while highlighting the words on the screen, and 
the participants responded by using a stylus to tap the screen. The weekly data collection 
schedule in this study means that it is not strictly focused on daily life as it was experi- 
enced, as some filtering and reconstructing of experience must inevitably have occurred 
before each sampling. Nevertheless, the multiple waves close in time did yield valuable 
data that illuminated intraindividual variation over a short time scale. 

These studies of children in middle childhood, though important, still do not col- 
lectively provide a comprehensive picture of the daily life of people at this life stage. The 
studies have proven that it is possible to collect intensive longitudinal data, including 
self-reports, from children as young as 7. Their focus on the mental health and emo- 
tional life of children is a good first step, but future studies will need to do much more 
to analyze the role of contextual factors in within-person variation. With the support 
of schoolteachers and parents, and with the use of new technologies for both delivering 
questions and collecting responses, it should be possible to design and implement a study 
that would enable a wide-ranging examination of factors in children’s daily lives that 
affect their development. 


Studies of Adolescents 


There are several good examples of such comprehensive developmental studies examining 
the second decade of life. Adolescents have been the focus of a plethora of ESM studies. 


Developmental Psychology 587 


Indeed, because they have the capacity for self-reflection yet are still easily accessible in 
large numbers via schools, adolescents have been ideal participants in studies of daily 
life. The dynamic variation in their emotions, the great range of contexts and activities 
in which they engage, and the swift pace of their developmental changes also make them 
very interesting subjects of study. 


Use of Time 


Most adolescents around the world go to school, but there is wide variation in the amount 
of time they actually spend on schoolwork. In a review of both ESM and non-ESM studies, 
Larson and Verma (1999) reported that Korean students spend about half of their wak- 
ing hours on schoolwork, Japanese about one-third, and Americans about one-fourth. 
Highly talented adolescents in one U.S. study spent more time engaged in classwork while 
in school, compared to average adolescents, but not outside of school, perhaps showing 
that they used their school time more efficiently (Csikszentmihalyi, Rathunde, & Whalen, 
1993). When not in school, adolescents have reported engaging in some household labor, 
with girls typically doing more than boys, and those from developing countries doing 
much more than those from developed nations (Larson & Verma, 1999). Boys spent more 
time engaging in informal sporting activities, but with age, both boys and girls decreased 
their participation in informal sports (Kirshnit, Ham, & Richards, 1989). 

Besides culture and gender, the socioeconomic class of an adolescent’s community 
also influences how time is spent. Shernoff and Csikszentmihalyi (2001) found that ado- 
lescents from higher-class communities spent more time in school and in public, and less 
time at home than those from lower-class communities. The higher-class denizens also 
spent more time engaged in hobbies and extracurricular activities, whereas lower-class 
teens spent more time in passive leisure activities, such as watching television. 


Companionship and Solitude 


Not surprisingly, as adolescents age they spend less time with family. Larson and Rich- 
ards (1991) documented a 40% drop in family time for adolescents between fifth and 
ninth grades. The decline continued through 12th grade, when they spent 60% less time 
with family than fifth graders (Larson, Richards, Moneta, Holmbeck, & Duckett, 1996). 
On the other hand, time spent one-on-one with either parent, though rare, did not decline 
with age. Time in family interactions and activities was replaced by solitude and, for girls, 
more time spent with friends. Although time alone is not usually accompanied by positive 
affect, there may be developmental advantages to spending a moderate amount of time 
alone (Larson, 1990). Talented adolescents spent more time alone and more time with 
their parents than average adolescents (Csikszentmihalyi et al., 1993). 


Emotional Experience 


Although the preceding descriptive information on the activities and social contexts of 
adolescents is important for understanding their development, most ESM research with 
adolescents has gone beyond this level to examine the relation between these external 
factors of daily life and psychosocial developmental processes. One basic developmental 
question is whether and how the experience of emotions changes with age. Compared to 
both preteens and adults, adolescents experience a more extreme range of positive and 
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negative emotions, and fluctuate between these extremes more rapidly (Larson, Moneta, 
Richards, & Wilson, 2002; Larson & Richards, 1994a; Verma & Larson, 1999). There is 
also some converging evidence that early adolescents experience an overall decline in the 
positivity of their emotions, which flattens out by late adolescence (Larson et al., 2002; 
Moneta, Schneider, & Csikszentmihalyi, 2001). Larson and colleagues (2002) showed 
that daily fluctuation in emotions slowed between early and late adolescence, indicating 
an increase in emotional stability with age. 

Interestingly, Moneta and colleagues (2001) found that despite a linear increase in 
global self-esteem throughout adolescence, as measured by questionnaire, momentary 
self-worth dipped throughout early adolescence to a low point around grade 10. This 
difference between momentary and global reports was also documented by Freeman, 
Csikszentmihalyi, and Larson (1986), who noted strong stability in affective states over 
2 years in the ESM reports of high school students, despite the students’ recollections of 
substantial positive changes. These findings are not interpreted as indicating the inaccu- 
racy of global recall, but rather as evidence of parallel developmental processes operating 
at different time scales. 

The growth in negative emotional experiences of early adolescents may in part be 
related to changes in their perceptions of events. Larson and Ham (1993) found that 
early adolescents reported more negative events with their peers, school, and family than 
did preteens, and that the association between negative events and negative affect was 
stronger for the adolescents. The sources of stress experienced by preteens were concrete, 
immediate activities, whereas early adolescents experienced stress from more abstract, 
social sources, including others’ feelings and anticipation of future social events (Larson 
& Asmussen, 1991). 

Another factor affecting these developmental changes in emotional experience in 
early adolescence is the set of biological changes associated with puberty. Richards and 
Larson (1993) showed that with greater pubertal development, boys experienced higher 
levels of tension, feeling in love, positive affect, attention, and strength. Puberty also 
increased feelings of being in love for girls, but otherwise had few or weak associations 
with mood levels and variability. To study changes in reward-related behavior, Forbes 
and colleagues (2010) linked an ESM study with functional magnetic resonance imaging 
(fMRI). They found that puberty appears to alter the functioning of two specific brain 
regions with respect to reacting to rewards, in ways that also negatively affect emotional 
experience. The authors suggest that adolescents may compensate for these changes by 
increasing their reward-seeking behavior at this developmental point. 


Flow and Intrinsic Motivation 


One of the original areas of focus of ESM research on adolescents was flow, the state of 
experience characterized by high levels of concentration, enjoyment, and intrinsic moti- 
vation (Csikszentmihalyi, 1990). Flow is not only an optimal momentary experience but 
also an integral contributor to optimal development (Csikszentmihalyi & Larson, 1984; 
Csikszentmihalyi et al., 1993; Hektner, 1996). For many people, flow occurs when their 
skills are matched by a commensurate level of challenges and opportunities. This equilib- 
rium is dynamic; given a static level of challenge, eventually the person’s skills increase, 
and boredom, relaxation, or apathy take hold. In order to maintain the balance, the per- 
son must continually find new challenging opportunities, which in turn require higher 
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levels of skill. The enjoyment of the flow experience provides the intrinsic motivation that 
drives development up the spiral of increasing complexity in challenges and skills. 
Empirical evidence of this link between flow experiences and positive developmental 
processes was found by Hektner (1996), who compared adolescents who increased in 
flow experiences over 2 years with those who decreased. Increases in flow were associ- 
ated with increases in time spent doing and thinking about productive activities, such as 
schoolwork, and decreases in time spent doing passive leisure activities, such as watching 
television. Moreover, those with more flow experiences had higher levels of motivation, 
concentration, self-esteem, and future goal orientation during their productive activities 
than did those with little flow. Adolescents who increased in flow experiences were better 
able to see connections between their current activities and their planned future careers, 
a difference from those who decreased which was present already at the initial time point, 
when both groups had equal amounts of flow. This link between current activities that 
produce flow and future goals was also found to be important in a separate sample of 
talented adolescents in the arts or science and math. Those who remained committed to 
their talent area 2 years later were the ones who experienced their activities using the tal- 
ent as both involving and important to their future (Csikszentmihalyi et al., 1993). 
Productive activities that are accompanied by a combination of intrinsic motivation, 
goal directedness, and concentration have been labeled growth-conducive experiences 
(Hektner, 2001). These experiences were found to occur more often when adolescents 
reported high levels of autonomy, as well as support and challenge from both their family 
and school. Challenge from school or family was related to goal directedness and con- 
centration during productive activities, and support was related to intrinsic motivation. 
But only the combination of strong challenge and support from the school and family 
contexts predicted the combination of motivation, concentration, and goal directedness. 


The Family Context 


Research on the role of the family in the daily life of adolescents has centered around two 
major questions: 


1. What is their emotional experience while with family, and how does that change 
with development? 

2. How do differences in support and challenge provided by the family relate to dif- 
ferences in daily life experiences relevant to developmental processes? 


Larson and colleagues (1996; Larson & Richards, 1991) documented a decline through- 
out the middle school years in the positive affect of adolescents while they were with their 
families that accompanies the drop in time they spent with family. By later adolescence, 
there was a complete rebound, which started to occur earlier for boys than for girls. The 
overall pattern of declining time with family, coupled with stable, one-on-one time with 
each parent and recovered emotional neutrality, led Larson and colleagues to conclude 
that adolescence is a time of both disengagement from the family and transformation in 
the relationship with continued connection. 

Other studies that have compared the impact of different families on the daily lives 
of adolescents have focused on the degree of support and challenge families offer. Rat- 
hunde (1996), the primary researcher in this area, defines support as offering comfort, 
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consistency, responsiveness, and harmony. Challenge entails stimulation, discipline, 
training, and expectations for the adolescent to develop autonomy and self-direction. 
Families offering high levels of both support and challenge were found to have ado- 
lescents with the highest levels of happiness at home, especially during home routines 
and homework, and these adolescents spent more time with their parents than did teens 
from other families (Csikszentmihalyi et al., 1993; Rathunde, Carroll, & Huang, 2000; 
Rathunde & Csikszentmihalyi, 1991). In addition, adolescents from high support/high 
challenge families had higher self-esteem and saw their activities as having greater impor- 
tance both in the moment and relative to their future goals (Rathunde et al., 2000). The 
different roles played by support and challenge are illuminated by findings from different 
samples, showing that adolescents from families offering only high levels of support had 
more positive moods and higher levels of spontaneous interest in schoolwork but saw 
their work as less important compared to other adolescents (Rathunde, 1996, 2001). 
On the other hand, adolescents from families with only a high level of challenge showed 
stronger goal-directed interest (characterized by a sense of importance) in their activities, 
but had lower moods, which may indicate a sense of drudgery. Only in families providing 
high levels of both support and challenge did adolescents experience the highest levels of 
undivided interest, combining positive moods, engagement, and openness with a strong 
sense of importance to goals (Rathunde, 1996, 2001). 


Experiences with Peers 


As noted earlier, adolescents gradually replace some of their family time with more time 
spent with their peers. Girls have been found to spend more time thinking about their 
peers and talking with their friends than boys do, and in conversations with friends, 
boys discuss other peers less and sports more than girls do (Raffaelli & Duckett, 1989; 
Richards, Crowe, Larson, & Swarr, 1998). The gender difference in amount of time spent 
talking with friends and in the topics discussed widens with age, suggesting that even as 
adolescent girls and boys begin to spend more time with each other, they are solidifying 
their gendered patterns of social interaction. 


Experiences in School 


School is a major context of adolescent development, not just cognitively, but psychoso- 
cially as well. Indeed, in the United States about one-third of school time is unstructured 
time spent not doing any academic activities (Shernoff, Knauth, & Makris, 2000). Dur- 
ing class, adolescents have rated their experience as below average on most affective and 
motivational dimensions (Csikszentmihalyi & Larson, 1984). An exception is that their 
concentration is typically high, although they report that it is difficult to concentrate, 
particularly in core academic classes. Students tend to experience greater intrinsic moti- 
vation and positive affect in nonacademic classes such as vocational education, computer 
science, and art than in core academic classes, of which history appears to elicit the low- 
est levels (Shernoff et al., 2000). One of the reasons for these differences across academic 
subjects lies in the way that they are taught. Students’ affect and intrinsic motivation are 
found to be lowest while listening to their teacher lecture, an activity that is done more 
often in history and other academic classes than in nonacademic classes. During group 
work and discussions, they usually have relatively higher levels of intrinsic motivation 
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and happiness, but these classroom activities occur less than 10% of the time (Csikszent- 
mihalyi & Larson, 1984; Shernoff et al., 2000). 

There are some cultural, social class, and gender differences in the way adolescents 
experience school. Asakawa and Csikszentmihalyi (1998) reported that Asian American 
adolescents experience greater enjoyment, self-esteem, control, and sense of importance 
to goals while studying than did European Americans. Autonomy support from their par- 
ents plus perceived competence while studying predicted grades for Asian Americans but 
not for European Americans (Asakawa, 2001). Girls in a diverse sample were found to 
experience higher levels of flow in school activities than boys (Shernoff et al., 2000). Girls 
also perceived classroom activities to be more important to their future goals than did 
boys. Boys had greater motivation and felt more freedom of choice than girls during phys- 
ical education classes (Kirshnit et al., 1989). Finally, students from low-income communi- 
ties, relative to those from more affluent communities, ascribed their nonacademic classes 
with greater importance to their future goals (Shernoff & Csikszentmihalyi, 2001). 


Experiences in Nonschool Activities 


Recently, there has been greater recognition of the significance of nonschool programs 
and institutions for adolescent development. Shernoff and Vandell (2007), in a study of 
eighth-grade participants in after-school programs, found that the students experienced 
high levels of intrinsic motivation and concentrated effort in sports and arts activities, but 
low levels of motivation and affect while doing homework. Activities involving both peers 
and adults produced the highest levels of intrinsic motivation and engagement. Findings 
from another study of the discretionary-time activities of urban African American fifth 
through eighth graders concurred that more active and structured activities are expe- 
rienced with the highest levels of positive affect (Bohnert, Richards, Kohl, & Randall, 
2009). More active but unstructured activity was associated with higher delinquency, but 
only for those living in dangerous neighborhoods. On the other hand, more passive and 
unstructured activities were related to more depressive symptoms, but only for those in 
safe neighborhoods. For troubled adolescents living in a residential treatment center, time 
in therapeutic activities and with clinical staff was experienced quite positively, whereas 
more unstructured moments with residential staff were among their most negative expe- 
riences (Hogen & Hektner, 2008). 

Adolescents with jobs typically enjoy work more than school (though it is still below 
their average enjoyment), but find school more important to their future goals than work 
(Schmidt, Rich, & Makris, 2000; Schneider & Stevenson, 1999). Work provides above- 
average levels of flow, engagement, self-esteem, and concentration, but neutral to slightly 
negative levels of happiness (Hektner, Asakawa, Knauth, & Henshaw, 2000; Schmidt et 
al., 2000). 


Studies of Emerging and Young Adults 


Although much of personality and social psychology is based on studies of college sopho- 
mores, there are surprisingly few studies of the daily life of college students and other 
emerging adults. In a study of college and graduate students, Homan and Hektner (2007) 
found that the students felt most creative during active leisure activities and in productive 
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activities done on campus but outside of class. Students’ lowest levels of feeling creative 
came during class time, during passive leisure, and while alone. Creativity on campus was 
related to greater autonomy support from instructors and to greater feelings of belonging- 
ness and social support. Those students who experienced more flow during their produc- 
tive activities produced drawings judged to be more creative than drawings produced by 
their peers. Finally, older students derived a greater sense of creativity from their produc- 
tive activities than did younger students. 

Most other studies of college students have focused on negative experiences. For 
example, feelings of depression and anxiety have been found to correlate in college stu- 
dents with variability (standard deviation) in self-esteem (Oosterwegel, Field, Hart, & 
Anderson, 2001), persistent and pervasive negative causal attributions after negative 
events (Swendsen, 1998), and concealable stigmas (e.g., being bisexual, bulimic, or poor; 
Frable, Platt, & Hoey, 1998). Those with concealable stigmas felt worse than those with 
conspicuous stigmas (e.g., overweight, stuttering) and were more often alone. Being with 
similar others lessened their anxiety and depression and improved their self-esteem, but 
these encounters were rare because of the difficulty in identifying similar peers. This 
implies that organized support groups may be more necessary and beneficial for those 
with concealable stigmas than for those with other problems. 

In a study that replicated some of the findings regarding adolescents, Fleeson and 
Cantor (1995) showed that women in a college sorority rated social situations more pleas- 
ant than academic situations or being alone, and rated weekends more pleasant than 
weekdays. However, the relevance of their current activity to the goal of getting along 
with others also contributed to its pleasantness, regardless of context. A group of these 
women were so achievement oriented that they saw academics as a relevant task, even 
during socializing (Harlow & Cantor, 1994). They received encouragement and reas- 
surance from their peers but sacrificed more enjoyable aspects of social interaction. As a 
result, their friendships suffered, and they had smaller social networks. 

Outside of the college setting, ESM has been used to study the daily life of young 
couples and new parents. Kirchler (1988) found that the men in couple relationships felt 
happier, stronger, and freer when with their partners than at other times, but the same 
affective benefits did not accrue to the women when with their men. Happier couples felt 
stronger and freer overall, even when not together, compared to less happy couples, and 
members of happy couples were better able to assess their partner’s needs in moments 
when they were together. Delle Fave and Massimini (2000, 2004) studied Italian cou- 
ples who had just become parents. These couples experienced caring for their infant not 
only as more challenging than work but also as bringing more positive moods than lei- 
sure. Fathers reported higher moods and more intrinsic rewards during child care activi- 
ties than mothers, whereas mothers indicated lower levels of confidence and skills than 
fathers. The combination of enjoyment and challenge found in caring for a new infant 
led Della Fave and Massimini (2004) to conclude that the transition to parenthood is an 
opportunity for adult development. 


Studies of Adults 


Many studies of the daily life of adults in midlife continue this focus on the parental and 
spousal role, often examining how these roles interact with the role of worker. Other 
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major developmental themes include the changing experience of emotions, and the stabil- 
ity and variability of personality traits across the adult lifespan. 


The Experience of Work and Family 


Individuals in couple relationships affect each other’s moods and behaviors. Several stud- 
ies have noted a correlation between the emotions of the two members of a couple, at 
least when they are together, and emotional transmission between spouses or partners 
has also been documented (Larson & Richards, 1994a, 1994b; Thompson & Bolger, 
1999). Larson and Richards (1994a) concluded that transmission is stronger from hus- 
bands to wives than vice versa. In a study that also tracked the women’s menstrual cycle 
over 30 days, LeFevre, Hendricks, Church, and McClintock (1992) found that women 
were happiest in their periovulatory and early menstrual phases. Their male partners 
experienced higher levels of activation (feeling strong, active, alert, outgoing) when the 
women were premenstrual. These biological effects on psychological states were smaller 
than other environmental effects. 

Other studies combining the assessment of a biological marker with an examination 
of daily life have focused on cortisol, a hormone associated with the body’s response to 
stress. Adam (2005) found adults (especially men) to have higher levels of cortisol during 
moments of negative mood and lower levels when they were engaged in positive social or 
productive activities. Although Adam’s results show lower levels of cortisol at work than 
would be expected by time of day, Klumb, Hoppmann, and Staats (2006) found daily 
cortisol levels to be positively related to time spent in paid labor. In their study of dual- 
earner German couples, cortisol levels also increased with time the individual spent in 
household work and with time that the individual’s spouse spent in paid labor. Time that 
the individual’s spouse spent doing household work was related to lower levels of cortisol 
for the individual. The precise mechanism underlying this effect of one person’s activities 
on another’s biological response needs further study to be elucidated. 

That doing housework is related to increases in a stress hormone is consistent with 
the results of several studies examining gender roles in the family. Mothers have been 
found to feel more distress and anxiety than fathers, particularly regarding home and 
family demands, and child arguments (Almeida & Kessler, 1998; Zuzanek & Mannell, 
1993). They do more housework and child care than fathers, even when both spouses 
work full time (Larson & Richards, 1994a; Zuzanek & Mannell, 1993). Men experience 
a greater sense of discretion when choosing to do housework and child care, which may 
be why they feel good while doing these things. Women report more positive moods dur- 
ing work activities than during home activities (Larson & Richards, 1994a; Williams, 
Suls, Alliger, Learner, & Wan, 1991), and a significant portion of their time is spent jug- 
gling two or more tasks from their work and family roles (Williams et al., 1991). While 
doing this “interrole juggling” they feel less enjoyment and more negative affect than at 
other times. 


The Experience of Emotions 


Although no long-term longitudinal studies of adults using ESM have been reported, sev- 
eral studies have used cross-sectional samples of wide-ranging ages or have incorporated 
ESM into one wave of a longitudinal study. These studies converge on the finding that the 
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frequency of positive emotions increases with age, while the experience of negative emo- 
tions decreases (Carstensen, Pasupathi, Mayr, & Nesselroade, 2000; Riediger & Freund, 
2008; Riediger, Schmiedek, Wagner, & Lindenberger, 2009). One mechanism for this 
developmental change may lie in changes in motivations and decreases in motivational 
conflicts. With age, Riediger and Freund (2008) found that people have fewer experi- 
ences in which they feel they want to or should be doing something other than what they 
are doing. In another study, of people ranging in age from 14 to 86, older adults had 
more prohedonic motivations (wanting to maintain positive affect and dampen negative 
affect), whereas adolescents reported more contrahedonic motivations (wanting to main- 
tain or enhance negative affect or dampen positive affect; Riediger et al., 2009). 

In addition to the developmental shift in the overall valence of experienced emotion, 
there may also be a change in the differentiation of emotions. Carstensen and colleagues 
(2000) found that 19 emotion items factored into a greater number of separate scales 
for older than for younger people. Furthermore, the negative correlation between posi- 
tive and negative affect declined with age, suggesting that people experience a greater 
mix of emotions simultaneously as they age. Tugade and Feldman Barrett (2002) called 
the reporting of emotional experience in a more discrete, differentiated way “emotional 
granularity,” and found that people with more granularity in their emotions also had 
better emotion regulation and coping skills. 


Personality Development 


Some researchers have focused on the stable personality traits underlying momentary 
emotional experiences. Marco and Suls (1993), for example, found that both momentary 
and daily negative moods were predicted by trait negative affect. People with higher levels 
of the trait reported more distress from current and prior problems, besides having gener- 
ally more negative moods even in nonproblem moments. Variability in mood may also be 
a stable personal characteristic; people show stability in the level of their mood variability 
across time and across situations (Penner, Shiffman, Paty, & Fritzsche, 1994). Fleeson 
and colleagues (Fleeson, 2001; Fleeson & Leicht, 2006; Noftle & Fleeson, 2010; see also 
Fleeson & Noftle, Chapter 29, this volume) note that the typical person’s distribution 
across occasions is about as wide as the distributions across individuals. In other words, 
people differ within themselves about as much as they differ from other people. Neverthe- 
less, there are still high levels of stability in the mean level and variation of various traits 
within a person from one week to the next. These traits may also wax or wane over the 
lifespan. Noftle and Fleeson (2010) found older and middle-aged people to exhibit more 
agreeable and emotionally stable behavior than the young, with less variability in the 
expression of these traits across situations. Middle-aged adults had the highest frequency 
of extraverted and conscientious behaviors. Variability in openness across situations was 
highest in older adults and lowest in young adults. 


Cognitive Development 


The finding that intraindividual variability in a trait is just as great as interindividual 
variability also appears to hold for certain cognitive abilities. In one of the only ESM 
studies focused on cognitive functioning in adults, Salthouse and Berish (2005) found 
that variability in reaction time on cognitive tasks was just as great within individuals as 
it was between them. They also replicated the age-related increase in reaction time across 
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the adult lifespan. As Hoppmann and Riediger (2009) conclude, training effects and 


method reactivity are key challenges that must be addressed in any research attempting 
to use experience sampling to study cognitive functioning. 


Studies of Older Adults 


As with young children, older adults who have diminished cognitive or physical capacity 
are a challenging population from whom to gather intensive longitudinal data regarding 
their momentary inner experiences. Generational norms may also interfere with data 
collection; most of the elderly widows in Hnatiuk’s (1991) sample did not take their pag- 
ers with them when they left the house, feeling that it was too interruptive and intrusive. 
On the other hand, some researchers have found success by gathering oral responses by 
phone or in person, several times daily, from nursing home residents (Voelkl & Mathieu, 
1993; Voelkl & Nicholson, 1992). The residents in these studies spent about 30% of their 
day resting or doing nothing (sitting, thinking, waiting). Their most pleasurable moments 
occurred while socializing, eating, and engaged in unstructured activities of their choos- 
ing. Moments with high levels of challenge and skill were also related to positive affect 
(Ellis, Voelkl, & Morris, 1994). Depressed residents spent more time watching television 
and did less self-care than nondepressed residents (Voelkl & Mathieu, 1993). 

Other studies concur that being with others increases positive affect for older adults, 
whereas being alone is related to more neutral or negative affect (Klumb, 2004; Lar- 
son, Zuzanek, & Mannell, 1985). Klumb (2004) found the association between affect 
and social context to be weaker for those who spend a lot of time alone. When alone, 
the independent-living retirees in the Larson and colleagues (1985) study not only felt a 
greater sense of control but also experienced a stronger wish to do something else. They 
were most often alone in the morning, but they felt worse about being alone in the after- 
noon and evening. 

Married retirees spent much less time alone than their unmarried peers, who averaged 
two-thirds of their time alone (Larson, Mannell, & Zuzanek, 1986). Their most positive 
affective experiences occurred while with their spouse and friends together. Moments 
with spouse alone or with family were emotionally neutral or below average, despite a 
positive correlation between time with spouse and global reports of life satisfaction. This 
paradox may be explained by examining activities in each context. Time with friends 
tended to be filled with more active leisure pursuits, contributing to positive moods. Time 
with family, however, was typically occupied by more maintenance and passive leisure 
activities. A more recent study confirmed that time spent with a spouse is related to posi- 
tive marital adjustment (Janicki, Kamarck, Shiffman, & Gwaltney, 2006). Those older 
adults who rated their spousal interactions higher on agreeableness and lower on conflict 
had better marital adjustment. The frequency of highly agreeable spousal interactions 
was related to marital adjustment, whereas the frequency of highly conflictual interac- 
tions was not. 


Advantages, Challenges, and Future Directions 


Clearly, studies of daily life have had a major impact on the field of developmental science. 
This research has advanced our understanding of immediate context, as that context is 
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perceived by the individual in the moment. In contrast to much of the laboratory-based 
research in personality and social psychology, daily life studies have strong ecological 
validity that makes their findings more readily accessible and relevant to practitioners 
and lay readers (see also Reis, Chapter 1, this volume). One of the major strengths of this 
set of methods is that it allows the study of intraindividual variability as a developmental 
variable, both as an outcome and as a factor in subsequent developmental processes. 

The focus on intraindividual variability is fairly recent and will continue to generate 
future research (see also Hamaker, Chapter 3, this volume). The collection of more bio- 
logical markers of physiological and neurological processes will likely also be an increas- 
ingly important component of daily life studies of development (see also Schlotz, Chap- 
ter 11, this volume). Technological advances will also make possible a wider range of 
data collection tools, including global positioning system (GPS) coordinates to pinpoint 
exact location, photographs or video and audio recordings to supplement self-report, and 
texting to incorporate open-ended responses (see also Goodwin, Chapter 14, and Mehl 
& Robbins, Chapter 10, this volume). 

Of course, anytime researchers provide people with high-tech gadgets to carry 
around in the real world, there is a risk of loss or damage to the gadgets. Adolescents, 
due in part to their developmental state, may be particularly prone to use an inadequate 
level of care. In our laboratory we minimized this risk by engraving each device with a 
“Property of... ” stamp and a unique ID number. Participants were told that the number 
would be used to track the device should it become lost. Even in a sample of troubled and 
delinquent teenagers, all devices were returned intact. 

Anytime children or adolescents are signaled during school hours, another concern 
is class disruption. If several students in a class are signaled, it is important that their 
devices be synchronized exactly, so that beeps from different devices are not scattered 
over the course of a minute. Additionally, a meeting with teachers before the study is 
needed to explain what will happen, to allay their fears, and to ask for suggestions and 
cooperation. To begin the study, students should be called to an orientation meeting, 
where they can be instructed how to silence their device and what to do if they need to be 
away from it or cannot answer (e.g., while swimming). 

Future developmental studies of daily life should attempt to deal with the limitations 
and challenges encountered to date. One of the most obvious challenges of these meth- 
ods is the cognitive and physical capabilities they require of research participants, thus 
limiting their use with the very young and the very old. However, as mentioned earlier, 
researchers have found ways to collect intensive longitudinal data from people represent- 
ing a large range of the lifespan. Infancy researchers have devised ingenious ways of 
ascertaining the focus of attention of their subjects by measuring habituation and dura- 
tion of gaze; perhaps some of these methods could be adapted into studies of the daily life 
of infants and toddlers. For example, caregivers could be signaled at random moments to 
turn on a video camera for 10-minute segments of recording an infant while the caregiver 
simultaneously responds to questions on context, activities, and behaviors. 

The other selectivity problem is a bit more subtle but perhaps more detrimental to our 
developmental conclusions. That problem is that even among the populations that are able 
to participate in daily life studies, the motivational demands of the intensive data collec- 
tion effort are such that our samples are undoubtedly overrepresented by more agreeable, 
conscientious individuals. A challenge for the next generation of studies will be to recruit 
and retain much more diverse and representative samples. To pursue this goal, research- 
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ers could adapt strategies from the community-based participatory research model used 
in the field of public health. In this approach, community members are involved from the 
outset in research design, planning, and implementation. 

In addition to broadening the base of participants, the field will also need to broaden 
the range of developmental domains studied by these methods. As shown here, hardly 
anything has been done to describe, explain, or understand cognitive development from 
the perspective of within-person variation across contexts and situations in daily life. 
Researchers will need to move beyond studies of reaction time to examine problem solv- 
ing, learning, and memory in different social contexts, at different times of day, and in 
relation to the individual’s current physiological and emotional state. 

Another way in which these methods could be applied on a broader scale is their 
incorporation in more experimental work with interventions aimed to promote positive 
development or prevent negative developmental outcomes. To date, nearly all daily life 
studies in developmental psychology have been descriptive and/or correlational. How- 
ever, these methods could be especially useful in measuring the mediators and modera- 
tors of intervention effectiveness, those variables that elucidate how, when, with whom, 
and under which conditions an intervention works. Basic developmental research would 
benefit as well from the inclusion of intensive longitudinal data in controlled studies. 

Finally, to understand truly developmental change across the lifespan, researchers 
will need to conduct macro-longitudinal studies that include micro-longitudinal data col- 
lection at each wave. Nearly all of the studies reviewed here are based on cross-sectional 
data, making developmental conclusions drawn from them vulnerable to error if cohort 
differences, rather than developmental changes, are reflected in the data. In the future, 
the strongest developmental studies will be those that combine the classic strategy of 
following people over a span of years with methods that also capture their daily lives as 
they are lived. 
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CHAPTER 34 


Industrial/Organizational 
Psychology 


DANIEL J. BEAL 


ndustrial and organizational (I/O) psychology often is thought of as an inherently 

applied field. Our top journal is, after all, the Journal of Applied Psychology, and over 
50% of people obtaining a PhD in the field enter into a nonacademic career (Society for 
Industrial and Organizational Psychology, 2006). Obviously, a field whose purpose is to 
study psychological issues within the workplace can and should make contributions to 
practitioners involved in the design of our work environments and the selection of the 
individuals who populate that environment. But psychological issues in the workplace 
reflect a domain broad enough to go beyond the development of practitioner tools. Work 
makes up a substantial portion of our adult lives, and our experiences at work often are 
critical to our overall well-being and sense of self. Given this central role of work in one’s 
life, a psychology of work must also explore work experiences irrespective of their poten- 
tial for practitioners. Indeed, recent calls in the I/O literature have emphasized the study 
of daily experiences as a primary venue through which work and our behavior at work 
can be understood (Beal, Weiss, Barros, & MacDermid, 2005; Weiss & Cropanzano, 
1996; Weiss & Rupp, 2011). 

The primary purpose of this volume is to discuss the methods used to examine daily 
life. In addition to the first three parts of this book, other researchers have explored 
methodological issues in the study of daily life generally and in I/O psychology in par- 
ticular (e.g., Beal & Weiss, 2003; Klumb, Elfering, & Herre, 2009). Rather than further 
explore the details of these methods per se, I provide a review of how they are used 
within the discipline of I/O psychology. Various names have been used to describe these 
methods within I/O psychology, including ecological momentary assessment, experience 
sampling, ambulatory assessment, and daily diary studies. In this chapter, I refer to these 
methods collectively as either daily experience research or simply experiential research. 
Before delving into this research, however, I should say a few words about what might be 
considered the dominant paradigm in I/O psychology (Weiss & Rupp, 2011). 
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For a great many years, I/O psychologists have been interested in characteristics of 
individuals. For example, as opposed to understanding affective experiences at work, 
I/O psychologists favored traits that reflect differential affective experience (extraver- 
sion, neuroticism, positive and negative affectivity, etc.). Similarly, as opposed to exam- 
ining when, why, and how people are satisfied with their experiences at work, the field 
emphasized summary evaluations of their job as a whole (i.e., job satisfaction as an 
attitude). For almost any particular phenomenon of interest, I/O psychology has tra- 
ditionally favored an examination of variance between individuals (or, more recently, 
between groups) as opposed to how an individual’s experiences and behaviors vary while 
at work. Other psychological disciplines are not as wedded to an individualist perspec- 
tive. For example, developmental psychology emphasizes age-related change, cognitive 
psychology emphasizes stimulus-related changes in mental processes, and social psychol- 
ogy emphasizes situational changes in social behavior. As a well-informed applied disci- 
pline, I/O psychology often borrows from these basic areas of research; however, when 
this research is conjoined with the emphasis on between-person variance, the results 
often are abstractions of the original process of interest (see Hamaker, Chapter 3, this 
volume). 

Despite the decontextualized nature of this perspective, there are good reasons for its 
emphasis. As discussed earlier, an important job of an I/O psychologist is to match indi- 
viduals to jobs in a way that maximizes the benefit for both employer and employee. So it 
is quite useful to know who will perform better at a given set of tasks, as this facilitates 
hiring individuals who will benefit the organization and allows those individuals to feel 
efficacious, productive, and appropriately compensated. After all, organizations typically 
hire individuals. This fact also highlights one of the dilemmas for an experiential, within- 
person perspective in I/O psychology: If organizations and practitioners make decisions 
concerning individuals based on I/O research, how will they be able to use the findings 
of an experiential psychology of work given its emphasis on within-person dynamics? In 
answering this question, a good place to begin is by exploring what the current research 
on daily experiences at work has uncovered. 


An Experiential Perspective of Work 


Consider what happens to you while at work—not the overall summaries of these hap- 
penings, but the occurrences themselves. The task is difficult because it forces you to 
select either prototypes or exemplars of actual experiences. Prototypes, as amalgams of 
actual experiences, lack the precision and detail of a particular experience; exemplars, 
though greater in precision and detail, lack generality to the range of events that have 
occurred. Thus, as the array of described experiences increases, so does the difficulty and 
bias of the resulting description (see Schwarz, Chapter 2, this volume). For this reason, 
daily experience methods are employed to obtain an accurate understanding of what 
actually happens at work. By asking participants to describe a limited array of experi- 
ences, difficulties associated with generality and precision are reduced. Some of these 
studies accomplish this by asking what has occurred over the course of a single day. 
Others achieve an even greater level of specificity by asking about immediate states and 
events. All of them emphasize immediate experience over broad generality. 
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The Structure of Work 


Thus far, I have emphasized immediate experiences at work over any other form of sum- 
mary evaluation. I have done this to make a point, but clearly there are reasons why some 
summary evaluations can be useful. For one, even though our experiences occur along 
a linear progression of time, we do not necessarily perceive this timeline so linearly. As 
events occur, some seemingly fly by, while others stagnate. Sometimes many seconds 
may pass with very little being encoded into our memories, while at other times, every 
moment is retained for future retrieval. Occasionally, we stop to reflect on a sequence of 
experiences to take stock of where we are and where we are going. These reflections are 
naturally occurring summary evaluations, and an accurate portrayal of our days at work 
should take these evaluations into consideration as well. 

Perhaps as a result of the between-person perspective in I/O psychology discussed 
earlier, little thought has been given to how workers actually experience and reflect on 
their days. Is it one long slog, broken up only by fleeting moments of socializing? Is it a 
continual flow of interactions, tasks, and other events that create a stream of experience 
punctuated only by the physical act of arriving at and leaving work? Or is it a series of 
discrete events, whose structure is predetermined partially independently and partially 
by management? Quite likely, there is some truth to each of these descriptions, and all 
might seem to fit depending on the particular day, job, or person who is asked. Certainly 
it is informative to know on which tasks one worked today or to whom one spoke. Such a 
piecemeal structure is missing several critical elements, however. First, it does not convey 
how these tasks and interactions were ordered. Nor does it suggest the duration of these 
events. Without these two pieces of information, we miss out on the potential to connect 
the influence from some daily events to subsequent events and event reactions. More 
broadly, we miss out on the interconnected stream of experience at work. The morning 
commute carries with it implications for one’s demeanor upon arriving at work. One’s 
demeanor during an interaction early in the day can influence the interpretation of events 
that occur later in the day. Struggling through an awkward midday conference with a 
client can result in a subsequent inability to concentrate on the day’s remaining tasks. All 
of these interesting and highly relevant occurrences are understood only by knowing how 
daily work experiences are structured, both subjectively and objectively. 

To address this issue, my colleagues and I (Beal et al., 2005) have advocated a con- 
sideration of how performance at work is structured episodically. This approach suggests 
that workers shift their focus of attention from one distinct, goal-relevant activity to 
another throughout the day. Presumably, the goals that are typically active at work tend 
to be performance-related episodes, but not all episodes at work necessarily involve per- 
formance (e.g., breaks, small talk with coworkers). Although there is certainly potential 
for interruptions during a performance episode, successfully shifting from one episode 
to another requires a conscious effort to stop thinking about one task and begin think- 
ing about another. To the extent that cognitive resources are then devoted to the focal 
performance episode, an individual will be maximally effective. 

Irrespective of the elements that define the structure of performance episodes, one 
of the more important reasons for construing work in this manner is that it allows for a 
chronological integration of other events; that is, there are a great many events occurring 
at work other than one’s own performance episodes. These events may be abrupt and 
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discrete, as in a scheduled break; or they may be diffuse and enduring, as in the office 
chatter that serves as either a wanted or unwanted distraction. Having an idea of which 
performance episodes are occurring throughout the workday allows us to examine with 
greater clarity the influence of a variety of events on a focal performance episode. More- 
over, it allows us to trace the continuous stream of experience at work. 

Of course, it is rarely, if at all, possible to trace a complete stream of work experi- 
ences throughout the day. Methods and technologies described in other parts of this 
handbook are primarily responsible for the close approximations of everyday experi- 
ence that researchers gather. Although I/O psychology has embraced experience sampling 
methods generally, their implementation often is difficult to achieve in a true workplace 
context. For example, a traditional random signal-contingent survey, requiring only a 
few minutes to complete, may often represent a significant disruption of a particular 
performance episode. As suggested earlier, such distractions can impede optimal perfor- 
mance during an episode. As a result, organizations often do not wish to participate in a 
study that necessarily disrupts the productivity of their employees, no matter how small 
that interruption may be. As a result, a great many of the experience sampling studies 
conducted in organizations emphasize surveys that occur at key points throughout the 
day. For example, if the episodic structure of work is known, surveys can be placed in 
between the episodes (e.g., Beal, Trougakos, Weiss, & Green, 2006). If interested in the 
cumulative or summary effects of the day’s activities, researchers often collect a greater 
amount of data at the end of the workday (e.g., Ilies, Wilson, & Wagner, 2009). Many 
interesting questions can be answered using research designs that are less burdensome 
on employees; still, as an active researcher in this area, I find myself most excited by the 
opportunities to obtain detailed data throughout the workday, as these data provide a 
clearer picture of the complex network of influences as they unfold. 


How Do We Feel at Work? 


As is the case with many other areas of psychology, the topic of affect in I/O psychology 
is accorded a special place in the study of daily experiences at work. Some of the earliest 
work in I/O involving experience sampling concerned affect (e.g., Csikszentmihalyi & 
LeFevre, 1989; Williams, Suls, Alliger, Learner, & Wan, 1991), including perhaps the 
earliest attempt to track meticulously the experiences of workers. Prior to his 1932 book, 
Rexford Hersey spent an entire year assessing a small but diverse set of workers up to 
four times per day. His primary interest was in linking valence of their emotional state 
to effectiveness (he concluded a definite positive relation), but he also obtained data con- 
cerning their dominant thoughts, stress and fatigue levels, and even physiological mea- 
surements, such as blood pressure. As discussed elsewhere (Weiss & Cropanzano, 1996), 
this classic study unfortunately and inexplicably marked the beginning of a long drought 
with respect to research on workplace affect, as well as daily life at work. 


Affective Experience 


More recently, however, there has been a resurgence in the study of affect at work, and 
quite often this research is conducted by using some form of daily experience method. 
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Part of this resurgence is due to an influential paper by Weiss and Cropanzano (1996) 
detailing what they referred to as affective events theory (AET). Though really more of a 
framework than a theory (see Weiss & Beal, 2005), the authors of this paper highlighted 
the importance of examining critical events at work and our reactions to those events as 
opposed to an exclusive focus on more generalized summary judgments (i.e., attitudes). 
Job attitudes may be important, but our affective experiences at work form the bedrock 
of those attitudes and often can better explain immediate and important behaviors at 
work (e.g., reactions to being treated unfairly). Unsurprisingly, much of the influence of 
this paper was in moving researchers away from treating job satisfaction simply as job- 
related affect, and more toward an acknowledgment and assessment of actual affective 
experiences as they occur at work. 

As a result of this trend, the last decade has seen numerous studies examining daily 
life in organizations and tracking what generates our affective experiences at work, as 
well as what these experiences translate into in terms of work-related behavior. Con- 
cerning the connection between job affect and job attitudes (especially job satisfaction), 
Weiss, Nicholas, and Daus (1999) verified that aggregations of immediate affective reac- 
tions at work hold a strong and independent relation to overall assessments of job satis- 
faction. Ilies and Judge (2002; Judge & Ilies, 2004) have noted that the overall attitude of 
job satisfaction itself varies over time, and they have used experience sampling methods 
to demonstrate that it is both predicted by and predictive of one’s affective state. 

Daily experience studies have also uncovered a number of interesting common pat- 
terns of affective experience at work. For example, Totterdell (2000) and his colleagues 
(Totterdell, Wall, Holman, Diamond, & Epitropaki, 2004) have convincingly demon- 
strated that moods are rather easily transmitted to other members of work groups, even 
after controlling for other shared affective experiences. Although these processes are 
strongest when workers share other characteristics (e.g., a working relationship, a simi- 
lar level in organizational hierarchy), these collective moods can nevertheless influence 
a variety of work-relevant issues, such as collective efficacy of the group or reactions to 
group mergers. 

Temporal patterns of affective experience at work also have been examined. Here, 
the time frame of the study design is critical for identifying particular patterns. For exam- 
ple, studies using closely spaced surveys (e.g., multiple times within a single workday) 
have noted affect cycles that repeat within a week, or even within a day (Trougakos, Beal, 
Green, & Weiss, 2008; Weiss et al., 1999). Beal and Ghandour (2011), who examined 
affect assessed at the end of the workday over several weeks, noted a clear cyclical pattern 
that repeated over the regular 7-day calendar week, with the low point of positive affect 
(PA) and the high point of negative affect (NA) occurring at midweek. Indeed, to the 
extent that workers structure their schedules to match the traditional “weekdays at work, 
weekends off work,” it seems likely that such a pattern often will be observed. 

Another area examining daily affective experience at work has examined the influ- 
ence of both general and specific events that often generate strong affective reactions. 
Miner, Glomb, and Hulin (2005) found that positive events at work are reported far more 
often than negative events at work, but have far less impact on one’s affective state. Beal 
and Ghandour (2011) examined specific events and the durability of their influence on 
PA and NA. These researchers found evidence that a relatively mundane positive event 
(i.e., working on an intrinsically motivating task) can influence not only concurrent levels 
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of PA but also PA experienced the next day. Finally, to foreshadow a later section, one 
of the more important classes of affective events to receive recent attention in research 
on daily work life is fairness- or justice-related events (Spencer & Rupp, 2009). Indeed, 
considering that the workplace often involves highly emotional events connected to how 
we are treated, it is surprising that researchers in this area have only recently adopted an 
experiential approach. 


Affect Expression 


Another area of research on workplace affect that has utilized the assessment of daily 
experiences is the regulation of emotional expressions at work, often referred to as emo- 
tional labor. For example, Beal and colleagues (2006) assessed the extent to which cheer- 
leading instructors were able to appear appropriately enthusiastic (i.e., as rated by super- 
visors) in the face of negative emotions. These authors found that having higher trait 
levels of emotion regulation eliminated any dampening influence of negative emotions. 
Surprisingly, in terms of appearing enthusiastic to their supervisors, the particular form 
of emotion regulation (suppression, reappraisal, etc.) did not matter; as long as instruc- 
tors engaged in some form of regulation, negative emotions had little influence on ratings 
of enthusiasm. Similarly, Miner and Glomb (2010) found that call center employees who 
scored high in trait levels of mood repair exhibited a weaker negative relation between 
immediate mood and work withdrawal (e.g., taking a break, talking with friends). 

Other researchers have focused on more immediate instances of emotion regulation. 
Bono, Foldes, Vinson, and Muros (2007) found that the effortful regulation of emotions 
at work generally resulted in subsequent stress (for up to 2 hours), but did not result in 
reduced job satisfaction when employees’ supervisors exhibited the qualities of a good 
leader (i.e., high in transformational leadership). Judge, Woolf, and Hurst (2009) noted 
a similar effect of effortful emotion regulation on negative affect and emotional exhaus- 
tion, and found evidence that this relation held primarily for those who were introverted. 
Finally, a group of researchers examining Dutch police officers (van Gelderen, Heuven, 
van Veldhoven, Zeelenberg, & Croon, 2007) found that the regulation of emotion dur- 
ing the day mediated the link between stress at the beginning of the day and stress at the 
end of the day. 

In summary, research on daily work life and affective experience covers a wide array 
of topics and continues to be one of the richer areas of investigation. Although a good 
number of pervasive workplace affective events have been identified and their conse- 
quences documented, there is undoubtedly a great deal more to be discovered. Given 
the inherently emotional nature of most workplaces, this is perhaps unsurprising. A 
dominant theme in this area of research, however, is that despite the emotional nature 
of the workplace context, not all individuals find it so emotional. Many of the stud- 
ies already mentioned have included dispositional moderators that attenuated either the 
affect-eliciting nature of an event or the influence an affective reaction had on some 
work-relevant behavior. As an example, Beal and Ghandour (2011) found that a major 
life event (i.e., the landfall of a destructive hurricane) did indeed reduce levels of daily PA 
at work; however, individuals who had higher levels of trait affect variability exhibited 
these reduced levels of PA through the end of the study, some 6 weeks after the hurricane 
passed through. Those lower in trait affect variability had returned to prehurricane levels 
of daily PA within a week of returning to work. 
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What Are We Thinking About at Work? 


Although innumerable topics of thought likely occur while at work, a useful distinction 
is on-task versus off-task attention and thought. Certainly, this is not the only manner in 
which work-related thought can be distributed, but the distinction places the emphasis on 
task performance—one of the more important topics in I/O psychology (e.g., Motowidlo, 
2003). 


On-Task Thought 


When task performance is construed as an episodic process of attentional focus that 
fluctuates greatly on a within-person basis, it becomes clear that any thoughts outside of 
the task, including other work-related thoughts, will detract from maximal performance 
during that performance episode. As such, it is vital to understand the proximal determi- 
nants and consequences of on-task thought. Experiential research related to this issue has 
appeared in at least three areas: work engagement, flow, and goal setting. 


ENGAGEMENT 


One of the few areas of experiential research in I/O psychology to examine on-task 
thought concerns the broader concept of work engagement (Macey & Schneider, 2008). 
Although many conceptualizations of work engagement extend well beyond the notion 
of immediate attention to one’s work, several studies have emphasized this aspect (e.g., 
Rothbard, 2001; Xanthopoulou, Bakker, Heuven, Demerouti, & Schaufeli, 2008). In 
particular, work absorption describes a factor of engagement that deals specifically with 
focusing attention on work tasks or even losing one’s tendency to be distracted by things 
other than the focal task. Perhaps due to the immediate and experiential nature of this 
construct, several studies have examined it using experience sampling methods (e.g., 
Mojza & Sonnentag, 2010; Sonnentag, 2003; Xanthopoulou et al., 2008). These studies 
have demonstrated that higher levels of work engagement can serve as buffers to the work 
constraints that often frustrate progress on work activities. Furthermore, work engage- 
ment often is preceded by supportive events, such as leisure time activities and coworker 
emotional support, and is often transferred from worker to worker in a manner similar to 
affective state (Bakker & Xanthopoulou, 2009). Although state-like treatments of work 
engagement have only recently appeared in the literature, the results are encouraging, 
particularly for identifying what facilitates and what interferes with becoming absorbed 
in one’s work. 


FLOW 


A similar line of experiential research and theory has investigated the notion of flow at 
work. Flow, as conceptualized by Csikszentmihalyi (e.g., Csikszentmihalyi & LeFevre, 
1989), is a state of optimal experience involving heightened focus of attention, aware- 
ness, and performance, and occurs when an activity is challenging but one’s skills are up 
to the challenge. Rather than detailing the task-related content of thought during these 
episodes, flow theory is concerned instead with states that result from being absorbed 
in one’s activities. According to the theory, this state of optimal experience is a primary 
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determinant of happiness and life satisfaction. Although the theory of flow was intro- 
duced more than 30 years ago, empirical research conducted in the workplace is seem- 
ingly rare. These studies have demonstrated, however, that conditions favoring flow (i.e., 
high demands and high skill levels) do produce pleasant experiences and heightened work 
motivation (Csikszentmihalyi & LeFevre, 1989; Fullagar & Kelloway, 2009). Indeed, 
there is some indication that introducing certain forms of work-related challenges (e.g., 
time pressure) can result in favorable work-related outcomes such as creativity and proac- 
tive behavior (Ohly & Fritz, 2010). Another consistent finding in this area is that flow 
states at work are tied to higher levels of performance primarily for those individuals who 
are conscientious (Demerouti, 2006; Snir & Zohar, 2008). 


GOAL SETTING 


Another factor influencing on-task thoughts involves the important activity of setting 
goals at work. Traditional research on goal setting has been enormously influential in 
I/O psychology (Locke & Latham, 1990), but only recently has it taken an experiential 
approach. The area seems promising, however, due to the finding that goals set at one 
point in time can influence the extent to which workers devote attention to a subsequent 
relevant task (e.g., Terborg, 1976). Put differently, goal setting is an activity that can 
encourage on-task thought. Recently, a small body of experiential research has exam- 
ined some basic issues in this area. Like the research on flow, scholars primarily have 
been interested in the consequences of focusing one’s attention on goals. For example, 
Harris, Daniels, and Briner (2003) examined the question of whether goal setting and 
attainment influence affective well-being at work, and found that attainment of goals led 
to increases in both activation (i.e., the arousal component of affective state) and pleas- 
antness levels. Claessens, van Eerde, Rutte, and Roe (2010) explored the characteristics 
of goals that result in the attainment of daily goals. Results suggested that the priority, 
urgency, and importance of goals all influenced the completion of tasks related to those 
goals. Interestingly, though a positive relation might be assumed between importance and 
task completion, these authors found the relation to be negative, suggesting that work- 
ers may focus on attaining goals that reflect the “low-hanging fruit.” Finally, Konig, van 
Eerde, and Burch (2010) found that adapting one’s goals throughout the day to address 
changing situations typically resulted in NA and lowered perceptions of productivity. 
Though this may seem to suggest that goal adaptation is counterproductive, this relation 
may exist simply because employees who adapt their goals tend to do so when they face 
greater obstacles (e.g., increased number of unplanned tasks and time needed to complete 
tasks), and their concerns for these obstacles are reflected in subsequent affective state 
and performance perceptions. 


Off-Task Thought 


Moving away from experiential research investigating on-task thought, there also is a 
large amount of work examining thoughts unrelated to the task at hand. Surprisingly, 
distraction itself—a topic that should be vital to understanding task performance—is 
largely confined to laboratory work in cognitive psychology (e.g., Lavie, 2005); however, 
I/O psychologists have examined specific domains of off-task thought. Although there is 
likely a multitude of off-task topics to attract the attention of workers, I/O psychologists 
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have placed particular emphasis on two such topics: organizational justice, or fair treat- 
ment at work, and the work-family balance. 


JUSTICE AND TREATMENT AT WORK 


Research examining daily life at work often touches on the topic of how others— 
particularly those who hold some power over us—treat or mistreat us. As mentioned 
earlier, this area overlaps greatly with research on affect, as events related to fairness are 
inherently affective (see Barsky, Kaplan, & Beal, 2011), yet affect is not always the cen- 
tral element of the research. For example, though Conway and Briner (2002) examined 
key predictors of emotion and daily mood at work, the primary emphasis was on under- 
standing the nature of the psychological contract, defined as the subjectively held beliefs 
employees have concerning the nature of exchange relationships with their employers. 
Conway and Briner’s early work in this area helped to document the importance of bro- 
ken and kept promises in predicting how an employee feels on a day-to-day basis. 

Since then, researchers in this area have extended the consequences of daily unfair 
treatment to other important and relevant behaviors at work, such as mistreating cowork- 
ers, using work time for personal activities, lateness to work, or other counterproductive 
work behaviors (Cohen-Charash & Mueller, 2007; Judge, Scott, & Ilies, 2006). An expe- 
riential perspective on justice research has also helped researchers better understand the 
structure and function of justice itself. In particular, Loi, Yang, and Diefendorff (2009) 
found that justice information received as a part of a social interaction was a strong pre- 
dictor of immediate ratings of job satisfaction, yet the influence of these forms of justice 
was moderated by broader, individual-level perceptions of outcome favorability and the 
fairness of the organization’s procedures (i.e., when employees hold broad perceptions of 
organizational unfairness, they are affected more by fairness-relevant interactions with 
others). 

Finally, with the advent of technologies such as the Electronically Activated Recorder 
(EAR; Mehl & Robbins, Chapter 10, this volume), through the content of specific inter- 
actions, researchers have begun to examine how employees treat each other. For example, 
Holleran, Whitehead, Schmader, and Mehl (2011) used the EAR to examine coworker 
conversations between men and women in the science, technology, engineering, and math 
(STEM) fields of academia. Results suggested that people engaged female faculty less fre- 
quently in research-oriented discussions, and that research discussions between male and 
female faculty resulted in greater job disengagement for females. 


WORK-LIFE BALANCE 


One of the larger areas of experience sampling research in I/O examines the interface 
between life outside of work and life at work. To the extent that events at home gener- 
ate thoughts while working, this topic can be considered a class of off-task thought. 
Truthfully, however, most of this research investigates the spillover of work into one’s 
life outside of work, particularly as it relates to relations with one’s family (which is per- 
haps responsible for the more common term work-family conflict given to this area of 
research). Nevertheless, the primary point here is that what goes on in one domain can 
influence how one feels, thinks, and behaves in the other domain. Indeed, one of the earli- 
est studies to use experience sampling methods at work (Williams et al., 1991) found that 
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working mothers often experienced negative affect and reduced task enjoyment when 
they had recently juggled both work and family tasks. Interestingly, these researchers also 
observed that these effects occurred primarily from one episode to the next within a day 
but tended to exhibit the opposite relations from one day to the next. 

An important methodological aspect of the studies in this area is that most include 
multiple settings (i.e., surveys at work and home) and often multiple sources of data (e.g., 
husband and wife reports). For example, Jones and Fletcher (1996) examined the influ- 
ence of work on affect and stress experienced at home. Results suggested that work stress 
influenced mood at home, and that communication of one’s mood resulted in a contagion 
effect on spousal mood. Expanding upon these initial findings, several other studies have 
verified the spillover of work events and stressors not only to the worker’s feelings and 
behaviors at home (e.g., Heller & Watson, 2005; Judge & Ilies, 2004), but also to the 
feelings and behavior of spouses (Ilies et al., 2007; Song, Foo, & Uy, 2008). 


What Are We Doing at Work? 


So far, I have discussed thoughts and feelings examined from an experiential perspective 
in I/O psychology. Filling out the triumvirate of experiences at work is actual behavior. 
Traditionally, behavior is a particularly important domain, as I/O psychologists generally 
acknowledge that performance at work is best conceptualized in terms of behaviors as 
opposed to the results or outcomes of behaviors (Motowidlo, 2003). A conceptualization 
of performance as observable behavior would seem to favor the methods and perspective 
of experience sampling, yet surprisingly few studies have attempted such an approach. 
Those that have fall roughly into two categories: examinations of either task or contex- 
tual performance (Motowidlo, Borman, & Schmit, 1997). Task performance consists of 
behaviors that support the technical core of the organization, such as transforming raw 
materials into products or services, or supplying an organization with its needed raw 
materials. Contextual performance consists of behaviors that do not support the techni- 
cal core per se, but do support the broader social and psychological environment of the 
organization. 


Task Performance 


Experiential studies examining task-related performance behaviors have faced an enor- 
mous difficulty: How does one obtain relatively objective ratings of performance repeat- 
edly over brief intervals of time? These studies have taken one of three approaches: self- 
reports of performance, observer reports of performance, and objective measures of 
performance. 


SELF-REPORTED PERFORMANCE 


Typically, researchers have shied away from using self-ratings of performance due to 
questionable validity and leniency issues (Harris & Schaubroeck, 1988). These results, 
however, have always concerned self-ratings made at a between-persons level. When the 
comparison is relative to one’s own level of performance (i.e., a within-person approach), 
it seems likely that these issues would be of less concern (Beal & Weiss, 2003). The earli- 
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est uses of daily experience methods to examine performance, however, were motivated 
less by an interest in the advantages of within-person measurement and more by the 
contextual richness of capturing work processes in the field. For example, Sellen (1994) 
was interested in the nature of everyday errors that occur at work. She had workers keep 
records of their daily slipups and mistakes, and used the diary data to create a basic tax- 
onomy of these sorts of errors. 

More recently, researchers have used self-reports of performance throughout the day 
to examine connections to other dynamic experiences and states. For example, Foo, Uy, 
and Baron (2009) found a link between affective state and self-reported effort. Specifi- 
cally, NA predicted required effort, but PA predicted effort beyond what is necessary. 
Dalal, Lam, Weiss, Welch, and Hulin (2009) examined the extent to which employees 
see changes in their own overall performance as a function of changes in task, as well as 
contextual performance. Results supported the view that overall job performance ratings 
are a function of both task and contextual performance, but only when contextual per- 
formance is directed toward the organization as a whole (as opposed to specific individu- 
als within the organization). 


OBSERVER REPORTS OF PERFORMANCE 


In studies of performance at a between-persons level of analysis, supervisory ratings are 
quite common. In within-person experiential studies, however, they are rare. The reason 
for their rarity is apparent when the logistics of the study design are considered. Not only 
are surveys collected from employee participants repeatedly over a relatively short time 
frame, but supervisors also need to supply performance evaluations that match the sched- 
ule of participants’ surveys. Beyond the analytical challenges associated with matching 
the temporal dynamics of both sources of data, there is also the fact that supervisors 
often represent a potential third level of analysis (if multiple employees are nested within 
a single supervisor). 

One such study, described earlier, examined the extent to which service providers 
behaved in accordance with affective display rules (i.e., acted spirited and enthusiastic). 
This specific form of performance behavior was examined in cheerleading instructors 
during a training camp (Beal et al., 2006). The logistical challenges described earlier were 
alleviated to some extent, as the instructional sessions were all coordinated; instructors 
completed surveys at the end of each session, and supervisors provided ratings for each 
instructor at some point during each session. As each supervisor rated several instructors, 
researchers overcame analytical challenges to some extent by first controlling for a fixed 
effect of supervisor (i.e., using a series of dummy-codes) prior to examining the fixed 
effects of interest in the study. 

A different approach to using observer reports of performance was taken in a study 
by Amabile, Schatzel, Moneta, and Kramer (2004). These researchers were interested in 
the supportive behaviors of leaders. As opposed to having a single source provide ratings 
for multiple targets, these researchers faced the opposite issue: Each leader was rated by 
a number of subordinates. Each subordinate completed a daily diary describing salient 
events that occurred during the day and rated the supportiveness of his or her leader for 
that day. Leader behaviors were then identified and coded from the event diaries. As 
expected, positive behaviors typically resulted in a positive relation with supportiveness 
ratings, and negative behaviors typically resulted in a negative relation with supportive- 
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ness. The particular types of behaviors that were most predictive on the positive side 
included recognition and positive monitoring, and on the negative side included poor 
efforts at motivation, inspiration, and problem solving. 

A final example of observer ratings comes from an unpublished dataset we collected 
recently (Beal, Trougakos, Dalal, Sundie, & Weiss, 2010). Again focusing on the regula- 
tion of emotions, we examined a sample of restaurant servers during their shifts over 
several weeks. The servers completed surveys of their experiences four times during each 
shift, and also handed out comment cards for their customers to complete anonymously. 
Customer comment cards that indicated date and time of completion were aligned with 
the windows in between the electronically time-stamped server surveys. The resulting 
dataset has enabled us to answer a variety of questions linking the server’s recent use of 
particular emotion regulation strategies to immediate and subsequent customer percep- 
tions of service performance. 


OBJECTIVE MEASURES OF PERFORMANCE 


Although it is difficult to identify a truly objective measure of performance, particularly 
if one defines performance in terms of behavior, a few experiential studies have used 
relatively more objective measures. For example, Trougakos and colleagues (2008) video- 
taped their participants at work multiple times over the course of several days, then used 
a behavioral coding scheme to assess performance-relevant aspects of their behavior from 
the videos. Researchers also have used tangible indices of behavior typically tracked by 
the organization. For example, Miner and Glomb (2010) examined length of calls taken 
by call center employees (shorter calls are preferred by management). These researchers 
found a significant negative relation between pleasantness of mood and the duration of 
call times (i.e., suggesting that positive mood led to at least one form of improved per- 
formance). Finally, some researchers have included their own cognitive tasks to measure 
performance as a part of the study. Totterdell, Spelten, Smith, Barton, and Folkard (1995) 
included both a reaction time task and a memory task in an examination of shift work 
schedules of nurses. Nurses completed these tasks, along with other self-report measures, 
every 2 hours during their shifts. Results suggested that reaction time and lapses in atten- 
tion increased in night shift nurses who were coming off long (i.e., 4-day) rest breaks. 


Contextual Performance 


Although task performance has been examined using multiple forms of measurement, 
the evaluation of contextual performance in experiential work psychology is uniformly 
based on self-reports. Also, these studies uniformly examine affective state as a precur- 
sor of these behaviors. Generally supported is the notion that PA encourages contextual 
performance, at least in the form of helping behavior (e.g., Dalal et al., 2009; Ilies, Scott, 
& Judge, 2006; but see Miner & Glomb, 2010, for a failure to find this effect). Further- 
more, several researchers find that NA often results in a negative form of contextual 
performance, referred to as counterproductive work behaviors (CWB; Dalal et al., 2009; 
Yang & Diefendorff, 2009). Finally, researchers also have been interested in the work- 
place events that trigger the affect-contextual performance sequence. One of the primary 
classes of events is the aforementioned fairness of treatment at work. For example, Yang 


Industrial/Organizational Psychology 613 


and Diefendorff found that mistreatment from supervisors and customers resulted in 
CWB and was mediated by negative emotions. 


How Are We Affected By Work? 


Though some of the research covered earlier (e.g., flow) discusses positive states that 
emerge from working, work often is examined as a potential source of stress. Specifi- 
cally, researchers are interested in identifying work conditions and individual behaviors 
that lead to the experience of stress, and ultimately to influences on worker health and 
well-being. 


Stress and Coping at Work 


One trend in this area of experiential research has been an effort to understand bet- 
ter the stress process at work through the application of Karasek’s model of job strain 
(Karasek & Theorell, 1990) to a within-person perspective. According to this model, job 
strain arises as a function of high job demands, low job control, and low social support; 
however, tests of this theory initially examined static perceptions of one’s job and met 
with mixed success. Daniels and his colleagues, however, have supported and extended 
the model by arguing (and, more important, measuring) these processes as they fluctu- 
ate within an individual over time (e.g., Daniels & Harris, 2005). In particular, these 
authors have emphasized that job control and social support enable workers to engage in 
problem-focused coping strategies in the face of job demands. What is learned in these 
episodes then leads to improved work outcomes, reduced levels of strain, and increases in 
activated positive affect (Daniels, Boocock, Glover, Hartley, & Holland, 2009). 

Other research generally has supported the demands-control-support (DCS) model 
of job strain but suggests that the model describes processes that are best captured on an 
intraindividual basis (Teuchmann, Totterdell, & Parker, 1999)—though there appear to 
be complexities still worth exploring. For example, Totterdell, Wood, and Wall (2006) 
assessed DCS model variables in a group of self-employed individuals for 26 weeks. 
Results supported the main effects of demands, control, and support on job strain, but 
did not support the standard interactive pattern of the DCS model (i.e., that control and 
support buffer the negative effects of demands); furthermore, there was some suggestion 
that an interactive pattern may have held for some individuals but not others. 

Consistent with the notion that the capacity to take advantage of available resources 
varies across people is recent work on individual differences in coping. In particular, 
Kammeyer-Mueller, Judge, and Scott (2009) found that individual differences in core 
self-evaluations (the overall tendency to view oneself positively) exhibited a main effect 
pattern of job strain; those higher in core self-evaluations simply perceive fewer stres- 
sors and less strain, and engage in fewer negative coping strategies. Emotional stability, 
however, revealed a more complex interactive pattern, whereby those lower in emotional 
stability react more negatively to stressors. Harris and Daniels (2005) have also found 
that individual differences in beliefs (i.e., beliefs that work demands will lead to stress 
and poor performance) moderate the influence of stressors on psychological well-being 
(i.e., stronger belief generates stronger effects). 
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Health Outcomes 


Extending the DCS and other job strain models further are studies that link job demands 
to actual employee health outcomes. Some of these studies examine physiological mea- 
sures of functioning as a proxy for stress and offer a reminder that stress has physical 
consequences. For example, Ilies, Dimotakis, and De Pater (2010) found support for 
the interactive effects of job control and organizational support on the relation between 
workload and blood pressure (BP), providing evidence for the DCS model using some- 
thing other than self-reported strain. Other researchers have extended the within-person 
approach to the DCS model by demonstrating links between daily work stressors and 
health-related behaviors such as snacking, smoking, exercise, and alcohol and caffeine 
consumption (Jones, O’Connor, Conner, McMillan, & Ferguson, 2007). Still other 
researchers have linked daily work stressors to specific pain symptoms, such as arthritis 
pain (Fifield et al., 2004) and low back pain (Gonge, Jensen, & Bonde, 2002). Pransky, 
Berndt, Finkelstein, Verma, and Agrawal (2005) found that the severity of headaches 
at work was associated with performance decrements, but perceived performance dec- 
rements were reported as much stronger than actual performance decrements. Finally, 
Scott and Judge (2006) found that insomnia produced a strong pattern of negativity at 
work (i.e., increased NA, decreased PA and daily job satisfaction), and that women exhib- 
ited stronger work effects of insomnia than men. 


Recovery 


The final area of daily work experience research examines how employees recuperate 
from the strains of work. As many workers use their time off work to engage in restorative 
activities, much of this work assesses experiences at work and outside of work. An early 
example of this work was conducted by Totterdell and his colleagues (1995), who exam- 
ined reports of well-being and cognitive functioning (via a reaction time task) in nurses 
working night shifts. Nurses with only a single rest day exhibited decreases in well-being 
and functioning, but recovered fairly consistently with 2 consecutive rest days. In another 
examination of health care professionals, Zohar, Tzischinski, and Epstein (2003) empha- 
sized the importance of goal-disruptive and goal-enhancing events while at work. These 
researchers found a number of interesting patterns of increased and decreased energy as 
a result of individuals experiencing these important events at work. 

Like research on stress and coping, factors leading to job strain figure prominently 
in research on recovery. For example, Rau, Georgiades, Fredrikson, Lemne, and de Faire 
(2001) found effects consistent with the DCS model on physiological recovery at night 
during sleep (e.g., latency to lowest nighttime systolic BP). The primary difference between 
experiential research on job stress and experiential research on recovery, however, is that 
much more attention is given to specific activities and their capacity to combat the fatigu- 
ing effects of job demands. Sonnentag (2001) found that spending leisure time on work- 
related activities resulted in decreases in well-being by the end of the day. In contrast, 
low-effort, social, and physical activities each predicted increases in well-being by day’s 
end. Subsequent studies (Sonnentag, Binnewies, & Mojza, 2008) have found that leisure 
time volunteer activities, mastery experiences (i.e., challenging off-job opportunities to 
learn and succeed), and psychological detachment from work (i.e., being able to not think 
about work while at home) were all found to contribute to improved well-being. Finally, 
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Trougakos and colleagues (2008) categorized brief break-time activities in terms of their 
requirements for self-regulation, and found that engaging in break activities that were 
high in regulatory requirements resulted in a reduced ability to regulate emotions upon 
returning to work. 


Conclusion 


As this literature review suggests, an examination of daily experiences at work has been 
an enormously useful approach within I/O psychology. The methods of this approach 
have been applied to most major areas of the field, resulting in new insights and greater 
understanding. Part of this insight comes from simply considering within-person vari- 
ance as meaningful as opposed to error associated with individual behavior. A second 
advantage uncovered with this approach is that a number of models in I/O psychology are 
process-oriented (the DCS model, many models of justice perceptions, goal setting, etc.), 
yet only through examining variability in experiences over time can we actually assess 
a process. Finally, an experiential approach to work psychology allows a more precise 
modeling of person-by-situation interactions (Klumb et al., 2009). 

At the beginning of this chapter, I highlighted some of the challenges associated 
with applying the methods discussed to the problems that I/O psychology practitioners 
face. For example, how can this within-person perspective be of use for selection issues, 
which seem to be inherently between persons in nature? One approach to this question 
is to suggest that most issues that seem to be inherently between persons in nature may 
not ultimately be so devoted to a single partition of the variance. For example, the field of 
selection in I/O psychology is concerned with finding the right people for the vast array of 
jobs that exist in our world. If nothing else, daily experience research has demonstrated 
that even the most static of constructs are, in truth, quite variable over time and across 
situations (Fleeson & Noftle, Chapter 29, this volume). If so, then the best model for 
selection may not be a static consideration of individuals; instead, organizations may 
find it useful to think in terms of which individuals would be most useful for a particular 
problem given the nature of the situation and the events that are likely to occur. This 
perspective suggests that thinking in terms of the best person for the job is overly simplis- 
tic for a complex and changing world. Undoubtedly, such a sea change in thinking will 
occur only when the research literature is replete with powerful examples of dynamic 
experiences at work. I hope this review of the literature makes it clear that—though not 
replete—research in I/O psychology is gaining momentum in this direction. 


References 


Amabile, T. M., Schatzel, E. A., Moneta, G. B., & Kramer, S. J. (2004). Leader behaviors and the work 
environment for creativity: Perceived leader support. Leadership Quarterly, 15, 5-32. 

Bakker, A. B., & Xanthopoulou, D. (2009). The crossover of daily work engagement: Test of an actor- 
partner interdependence model. Journal of Applied Psychology, 94, 1562-1571. 

Barsky, A., Kaplan, S. A., & Beal, D. J. (2011). Just feelings?: The role of affect in the formation of 
organizational fairness judgments. Journal of Management, 37, 248-279. 

Beal, D. J., & Ghandour, L. (2011). Stability, change, and the stability of change in daily workplace 
affect. Journal of Organizational Behavior, 32, 526-546. 


616 RESEARCH APPLICATIONS 


Beal, D. J., Trougakos, J. P., Dalal, R. S., Sundie, J. M., & Weiss, H. M. (2010). [A limited resource view 
of emotion regulation and performance in restaurant servers]. Unpublished raw data. 

Beal, D. J., Trougakos, J. P., Weiss, H. M., & Green, S. G. (2006). Episodic processes in emotional 
labor: Perceptions of affective delivery and regulation strategies. Journal of Applied Psychology, 
91, 1053-1065. 

Beal, D. J., & Weiss, H. M. (2003). Methods of ecological momentary assessment in organizational 
research. Organizational Research Methods, 6, 440-464. 

Beal, D. J., Weiss, H. M., Barros, E., & MacDermid, S. M. (2005). An episodic process model of affec- 
tive influences on performance. Journal of Applied Psychology, 90, 1054-1068. 

Bono, J. E., Foldes, H. J., Vinson, G., & Muros, J. P. (2007). Workplace emotions: The role of supervi- 
sion and leadership. Journal of Applied Psychology, 92, 1357-1367. 

Claessens, B. J. C., van Eerde, W., Rutte, C. G., & Roe, R. A. (2010). Things to do today ...: A 
daily diary study on task completion at work. Applied Psychology: An International Review, 59, 
273-295. 

Cohen-Charash, Y., & Mueller, J. S. (2007). Does perceived unfairness exacerbate or mitigate inter- 
personal counterproductive work behaviors related to envy? Journal of Applied Psychology, 92, 
666-680. 

Conway, N., & Briner, R. B. (2002). A daily diary study of affective responses to psychological contract 
breach and exceeded promises. Journal of Organizational Behavior, 23, 287-302. 

Csikszentmihalyi, M., & LeFevre, J. (1989). Optimal experience in work and leisure. Journal of Person- 
ality and Social Psychology, 56, 815-822. 

Dalal, R. S., Lam, H., Weiss, H. M., Welch, E. R., & Hulin, C. L. (2009). A within-person approach to 
work behavior and performance: Concurrent and lagged citizenship-counterproductivity associa- 
tions, and dynamic relationships with affect and overall job performance. Academy of Manage- 
ment Journal, 52, 1051-1066. 

Daniels, K., Boocock, G., Glover, J., Hartley, R., & Holland, J. (2009). An experience sampling study 
of learning, affect, and the demands control support model. Journal of Applied Psychology, 94, 
1003-1017. 

Daniels, K., & Harris, C. (2005). A daily diary study of coping in the context of the job demands- 
control-support model. Journal of Vocational Behavior, 66, 219-237. 

Demerouti, E. (2006). Job characteristics, flow, and performance: The moderating role of conscientious- 
ness. Journal of Occupational Health Psychology, 11, 266-280. 

Fifield, J., McQuillan, J., Armeli, S., Tennen, H., Reisine, S., & Affleck, G. (2004). Chronic strain, daily 
work stress and pain among workers with rheumatoid arthritis: Does job stress make a bad day 
worse? Work and Stress, 18, 275-291. 

Foo, M., Uy, M. A., & Baron, R. A. (2009). How do feelings influence effort?: An empirical study of 
entrepreneurs’ affect and venture effort. Journal of Applied Psychology, 94, 1086-1094. 

Fullagar, C. J., & Kelloway, E. K. (2009). “Flow” at work: An experience sampling approach. Journal 
of Occupational and Organizational Psychology, 82, 595-615. 

Gonge, H., Jensen, L. D., & Bonde, J. P. (2002). Are psychosocial factors associated with low-back pain 
among nursing personnel? Work and Stress, 16, 79-87. 

Harris, C., & Daniels, K. (2005). Daily affect and daily beliefs. Journal of Occupational Health Psy- 
chology, 10, 415-428. 

Harris, C., Daniels, K., & Briner, R. B. (2003). A daily diary study of goals and affective well-being at 
work. Journal of Occupational and Organizational Psychology, 76, 401-410. 

Harris, M. M., & Schaubroeck, J. (1988). A meta-analysis of self-supervisor, self-peer, and peer- 
supervisor ratings. Personnel Psychology, 41, 43-62. 

Heller, D., & Watson, D. (2005). The dynamic spillover of satisfaction between work and marriage: The 
role of time and mood. Journal of Applied Psychology, 90, 1273-1279. 

Hersey, R. B. (1932). Workers’ emotions in shop and home: A study of individual workers from the 
psychological and physiological standpoint. Philadelphia: University of Pennsylvania Press. 
Holleran, S. E., Whitehead, J., Schmader, T., & Mehl, M. R. (2011). Talking shop and shooting the 
breeze: A study of workplace conversation and job disengagement among STEM faculty. Social 

Psychological and Personality Science, 2, 65-71. 

Ilies, R., Dimotakis, N., & De Pater, I. E. (2010). Psychological and physiological reactions to high 

workloads: Implications for well-being. Personnel Psychology, 63, 407-436. 


Industrial/Organizational Psychology 617 


Ilies, R., & Judge, T. A. (2002). Understanding the dynamic relationships among personality, mood, 
and job satisfaction: A field experience sampling study. Organizational Behavior and Human 
Decision Processes, 89, 1119-1139. 

Ilies, R., Schwind, K. M., Wagner, D. T., Johnson, M. D., DeRue, D. S., & Ilgen, D. R. (2007). When 
can employees have a family life?: The effects of daily workload and affect on work-family conflict 
and social behaviors at home. Journal of Applied Psychology, 92, 1368-1379. 

Ilies, R., Scott, B. A., & Judge, T. A. (2006). The interactive effects of personal traits and experienced 
states on intraindividual patterns of citizenship behavior. Academy of Management Journal, 49, 
561-575. 

Ilies, R., Wilson, K. S., & Wagner, D. T. (2009). The spillover of daily job satisfaction onto employees’ 
family lives: The facilitating role of work-family integration. Academy of Management Journal, 
52, 87-102. 

Jones, F., & Fletcher, B. (1996). Taking work home: A study of daily fluctuations in work stressors, 
effects on moods and impacts on marital partners. Journal of Occupational and Organizational 
Psychology, 69, 89-106. 

Jones, F., O’Connor, D. B., Conner, M., McMillan, B., & Ferguson, E. (2007). Impact of daily mood, 
work hours, and iso-strain variables on self-reported health behaviors. Journal of Applied Psy- 
chology, 92, 1731-1740. 

Judge, T. A., & Ilies, R. (2004). Affect and job satisfaction: A study of their relationship at work and at 
home. Journal of Applied Psychology, 89, 661-673. 

Judge, T. A., Scott, B. A., & Ilies, R. (2006). Hostility, job attitudes, and workplace deviance: Test of a 
multilevel model. Journal of Applied Psychology, 91, 126-138. 

Judge, T. A., Woolf, E. F., & Hurst, C. (2009). Is emotional labor more difficult for some than for oth- 
ers?: A multilevel, experience-sampling study. Personnel Psychology, 62, 57-88. 

Kammeyer-Mueller, J. D., Judge, T. A., & Scott, B. A. (2009). The role of core self-evaluations in the 
coping process. Journal of Applied Psychology, 94, 177-195. 

Karasek, R. A., & Theorell, T. (1990). Healthy work. New York: Basic Books. 

Klumb, P., Elfering, A., & Herre, C. (2009). Ambulatory assessment in industrial/organizational psy- 
chology. European Psychologist, 14, 120-131. 

König, C. J., van Eerde, W., & Burch, A. (2010). Predictors and consequences of daily goal adaptation: 
A diary study. Journal of Personnel Psychology, 9, 50-56. 

Lavie, N. (2005). Distracted and confused?: Selective attention under load. Trends in Cognitive Sci- 
ences, 9, 75-82. 

Locke, E. A., & Latham, G. P. (1990). A theory of goal setting and task performance. Englewood Cliffs, 
NJ: Prentice-Hall. 

Loi, R., Yang, J., & Diefendorff, J. M. (2009). Four-factor justice and daily job satisfaction: A multilevel 
investigation. Journal of Applied Psychology, 94, 770-781. 

Macey, W. H., & Schneider, B. (2008). The meaning of employee engagement. Industrial and Organi- 

zational Psychology: Perspectives on Science and Practice, 1, 3-30. 

Miner, A. G., & Glomb, T. M. (2010). State mood, task performance, and behavior at work: A within- 

persons approach. Organizational Behavior and Human Decision Processes, 112, 43-57. 

Miner, A. G., Glomb, T. M., & Hulin, C. (2005). Experience sampling mood and its correlates at work. 

Journal of Occupational and Organizational Psychology, 78, 171-193. 

Mojza, E. J., & Sonnentag, S. (2010). Does volunteer work during leisure time buffer negative effects 

of job stressors?: A diary study. European Journal of Work and Organizational Psychology, 19, 

231-252. 

Motowidlo, S. J. (2003). Job performance. In W. C. Borman, D. R. Ilgen, & R. J. Klimowski (Eds.), 

Handbook of psychology: Industrial and organizational psychology (pp. 39-54). New York: 

Wiley. 

Motowidlo, S. J., Borman, W. C., & Schmit, M. J. (1997). A theory of individual differences in task and 
contextual performance. Human Performance, 10, 71-83. 

Ohly, S., & Fritz, C. (2010). Work characteristics, challenge appraisal, creativity, and proactive behav- 
ior: A multi-level study. Journal of Organizational Behavior, 31, 543-565. 

Pransky, G. S., Berndt, E., Finkelstein, S. N., Verma, S., & Agrawal, A. (2005). Performance decrements 
resulting from illness in the workplace: The effect of headaches. Journal of Occupational and 
Environmental Medicine, 47, 34-40. 


618 RESEARCH APPLICATIONS 


Rau, R., Georgiades, A., Fredrikson, M., Lemne, C., & de Faire, U. (2001). Psychosocial work charac- 
teristics and perceived control in relation to cardiovascular rewind at night. Journal of Occupa- 
tional Health Psychology, 6, 171-181. 

Rothbard, N. P. (2001). Enriching or depleting?: The dynamics of engagement in work and family roles. 
Administrative Science Quarterly, 46, 655-684. 

Scott, B. A., & Judge, T. A. (2006). Insomnia, emotions, and job satisfaction: A multilevel study. Jour- 
nal of Management, 32, 622-645. 

Sellen, A. J. (1994). Detection of everyday errors. Applied Psychology: An International Review, 43, 
475-498. 

Snir, R., & Zohar, D. (2008). Workaholism as discretionary time investment at work: An experience- 
sampling study. Applied Psychology: An International Review, 57, 109-127. 

Society for Industrial and Organizational Psychology. (2006). Member survey. Retrieved July 12, 2010, 
from www.siop.org/userfiles/image/2006membersurvey. 

Song, Z., Foo, M., & Uy, M. A. (2008). Mood spillover and crossover among dual-earner couples: A cell 
phone event sampling study. Journal of Applied Psychology, 93, 443-452. 

Sonnentag, S. (2001). Work, recovery activities, and individual well-being: A diary study. Journal of 
Occupational Health Psychology, 6, 196-210. 

Sonnentag, S. (2003). Recovery, work engagement, and proactive behavior: A new look at the interface 
between nonwork and work. Journal of Applied Psychology, 88, 518-528. 

Sonnentag, S., Binnewies, C., & Mojza, E. J. (2008). “Did you have a nice evening?”: A day-level study 
on recovery experiences, sleep, and affect. Journal of Applied Psychology, 93, 674-684. 

Spencer, S., & Rupp, D. E. (2009). Angry, guilty, and conflicted: Injustice toward coworkers heightens 
emotional labor through cognitive and emotional mechanisms. Journal of Applied Psychology, 
94, 429-444. 

Terborg, J. R. (1976). The motivational components of goal setting. Journal of Applied Psychology, 61, 
613-621. 

Teuchmann, K., Totterdell, P., & Parker, S. K. (1999). Rushed, unhappy, and drained: An experi- 
ence sampling study of relations between time pressure, perceived control, mood, and emotional 
exhaustion in a group of accountants. Journal of Occupational Health Psychology, 4, 37-54. 

Totterdell, P. (2000). Catching moods and hitting runs: Mood linkage and subjective performance in 
professional sport teams. Journal of Applied Psychology, 85, 848-859. 

Totterdell, P., Spelten, E., Smith, L., Barton, J., & Folkard, S. (1995). Recovery from work shifts: How 
long does it take? Journal of Applied Psychology, 80, 43-57. 

Totterdell, P., Wall, T., Holman, D., Diamond, H., & Epitropaki, O. (2004). Affect networks: A struc- 
tural analysis of the relationship between work ties and job-related affect. Journal of Applied 
Psychology, 89, 854-867. 

Totterdell, P., Wood, S., & Wall, T. (2005). An intra-individual test of the demands-control model: 
A weekly diary study of psychological strain in portfolio workers. Journal of Occupational and 
Organizational Psychology, 79, 63-84. 

Trougakos, J. P., Beal, D. J., Green, S. G., & Weiss, H. M. (2008). Making the break count: An episodic 
examination of recovery activities, emotional experiences, and positive affective displays. Acad- 
emy of Management Journal, 51, 131-146. 

van Gelderen, B., Heuven, E., van Veldhoven, M., Zeelenberg, M., & Croon, M. (2007). Psychological 
strain and emotional labor among police-officers: A diary study. Journal of Vocational Behavior, 
71, 446-459. 

Weiss, H. M., & Beal, D. J. (2005). Reflections on affective events theory. In N. M. Ashkanasy, W. J. 
Zerbe, & C. E. Hartel (Eds.), Research on emotions in organizations: The effect of affect in orga- 
nizational settings (pp. 1-21). Oxford, UK: Elsevier. 

Weiss, H. M., & Cropanzano, R. (1996). Affective events theory: A theoretical discussion of the struc- 
ture, causes, and consequences of affective experiences at work. Research in Organizational 
Behavior, 18, 1-74. 

Weiss, H. M., Nicholas, J. P., & Daus, C. S. (1999). An examination of the joint effects of affective 
experiences and job beliefs on job satisfaction and variations in affective experiences over time. 
Organizational Behavior and Human Decision Processes, 78, 1-24. 

Weiss, H. M., & Rupp, D. E. (2011). Experiencing work: An essay on a person-centric work psychology. 
Industrial and Organizational Psychology: Perspectives on Science and Practice, 4, 83-97. 


Industrial/Organizational Psychology 619 


Williams, K. J., Suls, J., Alliger, G. M., Learner, S. M., & Wan, C. K. (1991). Multiple role juggling and 
daily mood states in working mothers: An experience sampling study. Journal of Applied Psychol- 
ogy, 76, 664-674. 

Xanthopoulou, D., Baker, A. B., Heuven, E., Demerouti, E., & Schaufeli, W. B. (2008). Working in the 
sky: A diary study on work engagement among flight attendants. Journal of Occupational Health 
Psychology, 13, 345-356. 

Yang, J., & Diefendorff, J. M. (2009). The relations of daily counterproductive workplace behavior 
with emotions, situational antecedents, and personality moderators: A diary study in Hong Kong. 
Personnel Psychology, 62, 259-295. 

Zohar, D., Tzischinski, O., & Epstein, R. (2003). Effects of energy availability on immediate and 
delayed emotional reactions to work events. Journal of Applied Psychology, 88, 1082-1093. 


CHAPTER 35 


Clinical Psychology 


TIMOTHY J. TRULL 
ULRICH W. EBNER-PRIEMER 
WHITNEY C. BROWN 
RACHEL L. TOMKO 
EMILY M. SCHEIDERER 


C ontemporary clinical psychology emphasizes empirically supported approaches to 
the assessment, prevention, and treatment of conditions that lead to human suffer- 
ing (Trull, 2007). This wide-ranging definition reflects the broad array of activities that 
characterize clinical psychologists; they may serve as assessment specialists, consultants, 
researchers, or clinicians, to name just a few of the possibilities. In this chapter, we high- 
light the application of experience sampling methods (ESM; Csikszentmihalyi & Larson, 
1987) and ecological momentary assessment methods (EMA; Stone & Shiffman, 1994) 
to the field of clinical psychology.! Specifically, we discuss how these methods can shed 
light on the nature of mental illness and its symptoms, monitor treatment progress and 
outcome, and assist in the delivery of treatment in daily life. Finally, we conclude with a 
discussion of future applications and challenges in using these methods. 

As mentioned throughout this handbook, ESM/EMA approaches are characterized 
by (1) collection of data in real-world environments; (2) assessments that focus on indi- 
viduals’ current or very recent states or behaviors; (3) assessments that may be event- 
based, time-based, or randomly prompted (depending on the research question); and (4) 
completion of multiple assessments over time (Stone & Shiffman, 1994). ESM/EMA can 
be conducted using a wide variety of media, including paper diaries, electronic diaries, 
or telephones. 

ESM/EMA data can help to address many of the goals of clinical psychology. Spe- 
cifically, these data can provide a detailed account and understanding of an individual’s 
problems as experienced in daily life. In turn, this information can both inform and 
enhance clinical treatment. In addition, this method can be quite valuable in the moni- 
toring of treatment progress. Patients and clinicians alike can use the feedback from this 
form of assessment to gauge how treatment is progressing and whether modifications are 
necessary. 
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This chapter provides an overview of ESM/EMA approaches to characterizing and 
describing clinical syndromes and associated features, to monitoring treatment progress 
and outcome, and to providing interactive treatment interventions in real or near-real 
time. Space constraints prohibit us from presenting a comprehensive review of all pub- 
lished studies that use ESM/EMA methods in the field of clinical psychology. In addition 
to our discussion, we point readers to several recent reviews of applications of ESM/EMA 
methods to clinical assessment and treatment (Alpers, 2009; Ebner-Priemer & Trull, 
2009; Oorschot, Kwapil, Delespaul, & Myin-Germeys, 2009; Piasecki, Hufford, Solhan, 
& Trull, 2007; Shiffman, 2009; Trull & Ebner-Priemer, 2009). 

Therefore, we present recent examples of these applications, then discuss future 
developments of ambulatory assessment in clinical psychology. 


Description of Psychopathology and Associated Features 


Perhaps the most obvious application of ESM/EMA methods in clinical psychology is to 
characterize clinical symptoms and features of psychopathology. ESM/EMA methods 
can be used to document and track problematic behaviors (e.g., substance use), mood 
states (e.g., depression, hostility), or cognitions (e.g., cognitive biases, urges/cravings) in 
the daily lives of patients. Using this assessment approach, one can better characterize the 
ebb and flow of behaviors, moods, and cognitions as they occur in daily life. 


Problematic Behavior 


Many forms of psychopathology have behavioral manifestations that are central to the 
disorder and targeted for treatment. For example, ESM/EMA methods have been used 
to track the use of alcohol, drugs, and cigarettes for many years (Shiffman, 2009). Accu- 
rate assessments of addictive behaviors as they occur in daily life are crucial. Retrospec- 
tive accounts of the frequency of substance use may be inaccurate and subject to biases, 
whether conscious or otherwise (Hufford, 2007; Shiffman, 2009). Substance use is a 
discrete but episodic behavior, most amenable to event-based sampling (i.e., complet- 
ing reports when the behavior occurs). Shiffman (2009) presents an overview of ESM/ 
EMA studies that target various addictive behaviors (e.g., smoking, alcohol, and drug 
use), as well as special issues such as compliance with, reactivity to, and validity of EMA 
reports. 

Likewise, other behaviors that characterize psychological disorders can also be 
assessed using an event-based sampling approach. For example, Smyth et al. (2009) used 
EMA methods to explore binge and purge episodes over a 2-week period in 133 women 
with bulimia nervosa. Using palmtop computers, participants completed three types of 
reports concerning mood, stress, bingeing behavior, and purging behavior (vomiting): 
semirandom reports six times per day when prompted by the palmtop, end-of-the-day 
reports, and event-contingent reports. Results indicated that bingeing episodes were less 
frequent in the mornings but peaked at 1:00 P.M. and between 7:00 and 9:00 P.M. As for 
purging, vomiting events tended to increase from early morning to midday, then peaked 
again between 7:00 and 9:00 P.M. Interestingly, bingeing and purging (to a lesser degree) 
tended to be more likely to occur on weekends (especially Sunday) than weekdays. Over- 
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all, these findings are consistent with the perspective that women are most vulnerable to 
bingeing and purging after mealtimes, and when at home and alone. 

The assessment of motor activity has been largely overlooked, despite its relevance 
to a wide range of psychopathology (Bussmann, Ebner-Priemer, & Fahrenberg, 2009; 
Tryon, 2006). Recently, Minassian and colleagues (2010) reported on an innovative 
study that examined motoric activity as a phenotype distinguishing bipolar disorders 
from schizophrenia. Rather than rely on the self-report or observer ratings of patients’ 
motor activity, this study used accelerometers to assess motor activity objectively. The 
authors hypothesized that hospitalized, currently manic inpatients with bipolar disorder 
would show higher motor activity when placed in a novel environment than healthy con- 
trols and acutely ill inpatients with schizophrenia. The novel environment was the human 
Behavioral Pattern Monitor, a room outfitted with 11 small, tactile, manipulable objects. 
Each participant was brought into the room for 15 minutes, and his or her motor activity 
was measured by the LifeShirt™, an upper-body garment. 

In addition to cardiac activity measured through electrocardiographic (ECG) leads 
connected to the LifeShirt and placed on the sternum, two dual-axis accelerometers were 
also placed in the fabric of the LifeShirt near the abdomen. Thus, motor activity was 
measured in multiple regions of the body. 

As predicted, patients with bipolar disorder showed the highest motoric activity over 
15 minutes, although their activity level slowed somewhat in the last 5 minutes of the 
study period. The authors suggest that these patients either habituated to the environ- 
ment or experienced some fatigue after intense activity. Interestingly, the patients with 
schizophrenia did not differ significantly from healthy controls in motoric activity. Only 
modest correlations were found between increased motoric activity and elevated mood 
in the patients with bipolar disorder, and no significant effects of medication on motoric 
activity were observed. Overall, this study highlights the promise of using activity mea- 
surement to assess cardinal features of some forms of psychopathology, and also shows 
that self-report of elevated mood state is only modestly correlated with objective mea- 
sures of whole-body activity levels in patients with bipolar disorder. 


Problematic Mood 


Many forms of psychopathology are characterized by elevated or extreme mood states 
(e.g., depression, anxiety, hostility, mania) as well as by extreme fluctuations in mood 
state (Ebner-Priemer & Trull, 2009). For example, major depressive disorder and dys- 
thymia involve high levels of depressive affect, and anxiety disorders involve elevations 
in fear, distress, and panic. Bipolar disorder and borderline personality disorder both 
involve fluctuations or instability of mood state. In the case of borderline personality 
disorder, instability in negative affect (depression, anxiety, anger) typically occurs within 
the day and perhaps several times per day (Trull et al., 2008). So, when assessing mood 
instability in borderline personality disorder, one must sample mood multiple times per 
day in order to capture the dynamic nature of these moods. In the case of bipolar disor- 
der, the dominant mood state (depression or mania/irritability) typically lasts for at least 
several days to weeks. Therefore, daily assessment of mood state in the case of bipolar 
disorder may be sufficient. 

Mood state and instability of mood have been assessed in many ESM/EMA studies 
(Ebner-Priemer & Trull, 2009; Nica & Links, 2009). For example, Vranceanu, Gallo, 
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and Bogart (2009) used EMA to examine associations among depressive symptoms, 
social experiences, and momentary affect. In their study, 108 women with varying levels 
of depressive symptoms (assessed by questionnaire) for 2 days carried handheld com- 
puters, on which they completed diary items assessing social (conflictive vs. supportive) 
and affective (negative vs. positive) experiences at random times throughout the day. 
Results revealed that higher levels of depressive symptoms were related to higher EMA- 
assessed NA and lower PA, independent of social conflict assessed in daily life. Socially 
supportive experiences were not significantly related to depressive symptoms as opposed 
to socially conflictive experiences. However, neither PA nor NA in daily life was associ- 
ated with social experiences. In general, these results are in line with the prior findings 
of the salience of interpersonal conflict, as well as the role of affect in the etiology and 
maintenance of depression symptomatology. Vranceanu and colleagues concluded that 
these results support the promotion of interventions that attempt to decrease socially 
conflictive experiences via cognitive-behavioral skills training. Furthermore, the findings 
support the targeting of PA and NA to help prevent and/or lessen depressive episodes in 
vulnerable individuals. 

Sakamoto and colleagues (2008) examined the relations between panic attacks and 
locomotor activity in 16 patients with panic disorder and agoraphobia. Patients wore 
for 2 weeks a watch-like electronic diary outfitted with a computer screen and joystick. 
Wrist activity monitors were also built into the watch-electronic devices. Patients were 
instructed to initiate a report on the electronic device each time they experienced a panic 
attack. As part of the report, participants were surveyed about various subjective physi- 
cal symptoms related to panic attack (e.g., “fear of going crazy”). These symptoms were 
rated on a visual analogue scale ranging from 0 to 100. Patient ratings of these symp- 
toms were used to “validate” that a panic attack had indeed occurred. Sakamoto and 
colleagues note that this method provides a more precise way to measure and character- 
ize panic disorder than simply having patients indicate (yes—no) whether a panic attack 
occurred. There was a positive association between number of panic attacks and mean 
activity level during the 2-week recording period. There was no clear, robust pattern as 
to whether increased activity level preceded or followed panic attacks. 


Problematic Cognition/Expectancies/Urges 


Cognitions, expectancies, and urges/cravings can also be targeted in ESM/EMA studies. 
For example, urges to drink can be tracked in those with alcohol use disorders, signaling 
vulnerability to relapse or bingeing. Similarly, expectancies and motivations for drug use 
can provide insight into the impetus for using illegal substances. Furthermore, cognitive 
biases associated with depression can be assessed along with mood state, providing a 
functional analysis of a depressed patient’s increases in depressed mood. 

Recently, Preston and colleagues (2009) used EMA to test whether craving leads to 
both reported and urinalysis-verified cocaine use. Participants were 112 polydrug users 
(heroin and cocaine), all of whom were currently in methadone maintenance therapy. 
Using an electronic diary, participants responded to random prompts assessing mood 
and craving five times per day and also completed event-contingent monitoring of heroin 
and/or cocaine use, as well as craving of either drug (without use). Results revealed a 
significant positive relationship between cocaine craving and cocaine use during the day. 
Interestingly, craving increased linearly in the 5 hours preceding cocaine use. In addition, 
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ratings of cocaine craving were higher during weeks of urinalysis-verified cocaine use 
than verified cocaine abstinence. However, these relationships did not hold for heroin 
craving. Last, during the 5 hours preceding cocaine use, heroin craving and mood failed 
to show any systematic changes, suggesting some specificity for cocaine craving in rela- 
tion to cocaine use. 

In this same polydrug-dependent sample, Epstein, Marrone, Heishman, Schmittner, 
and Preston (2010) investigated how cigarette smoking relates to other drug cravings 
and use. Participants were 106 methadone-maintained cocaine- and heroin-using out- 
patients who received five random prompts per day for 5 weeks, then two per day for 
20 weeks. Smoking frequency was positively associated with random-prompt ratings of 
tobacco craving, cocaine craving, and “dual” craving (craving of both cocaine and heroin 
at the same time). Frequency of smoking was not significantly different during episodes 
of cocaine use versus cocaine craving, and smoking frequency was not significantly dif- 
ferent during episodes of dual use versus dual craving. Epstein et al. concluded that their 
findings confirm the strong association between smoking and cocaine use, indicating 
that smoking and tobacco craving are lower during cocaine abstinence than during use. 
This finding is consistent with the model in which tobacco and cocaine each increase the 
craving for and use of themselves and each other. One clinical implication is to provide 
treatment for tobacco dependence during cocaine abstinence. 

Using EMA methods, Nock, Prinstein, and Sterba (2009) set out to determine the 
frequency of, context of, and motivation for self-harm behaviors in adolescents and young 
adults, all of whom had engaged in self-harm over the previous year. For 14 days, par- 
ticipants responded to two prompts on a personal digital assistant (PDA) at both midday 
and end of the day. Furthermore, participants were to complete event-contingent prompts 
whenever they experienced self-destructive thoughts or behaviors. The 30 participants, 
on average, endorsed five self-injurious thoughts per week and reported engaging in 1.6 
self-harm behaviors per week during the study. The authors noted several interesting 
findings. Suicidal ideation, as opposed to self-harm ideation, tended to be longer in dura- 
tion. In addition, suicidal thoughts tended to overlap with other destructive behavioral 
ideation (e.g., drug use), whereas self-harm thoughts overlapped mainly with suicidal 
ideation only. Finally, the odds of engaging in self-harm behaviors increased after partici- 
pants felt rejected, angry toward another or the self, numb, worthless, or self-hating. 

As these studies of psychopathology demonstrate, many findings from ESM/EMA 
studies could be obtained only through these methods and not through traditional ques- 
tionnaire and interview approaches. Individuals are notoriously inaccurate in their retro- 
spective reports of the timing, frequency, and context of symptoms of psychopathology. 
Real-time assessment attenuates the amount of retrospection necessary, and individuals 
report on their own behaviors, emotions, and cognitions as they occur in daily life. 


Monitoring Response to Treatment and Interventions 


In addition to describing and characterizing psychopathology and its correlates, ESM/ 
EMA methods can be used to monitor treatment progress. Traditionally, treatment prog- 
ress is assessed using weekly or monthly questionnaires, or interviews that require retro- 
spection on the part of the patient. However, these retrospective self-report measures are 
notoriously context-dependent (see Schwarz, Chapter 2, this volume) and highly influ- 
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enced by momentarily accessible information (Fredrickson, 2000; Gorin & Stone, 2001; 
Kahneman, Fredrickson, Schreiber, & Redelmeier, 1993; Kihlstrom, Eich, Sandbrand, 
& Tobias, 2000; Shiffman, Stone, & Hufford, 2008; Solhan, Trull, Jahng, & Wood, 
2009). 

In an early study using paper diaries, Barge-Schaapveld and Nicolson (2002) assessed 
effects of tricyclic antidepressant treatment on the quality and related aspects of daily life 
experience in patients with major depression. Patients were randomly assigned to either 
an imipramine or placebo condition. Patients were prompted by a wristwatch alarm 10 
times per day to complete an ESM report that assessed the effects of depression and 
antidepressant treatment on well-being, mood states, physical complaints, pleasure from 
activities, and time use (activities). Although mean levels of momentary quality-of-life 
measures were not significantly different between the groups, the variability was sig- 
nificantly less in the imipramine group. For remitted patients in the imipramine group 
at the end of the study, mean momentary quality-of-life scores were significantly higher 
than those for the control group. Several aspects of quality of life improved in the imip- 
ramine group over the course of treatment. Finally, the authors reported that only a small 
percentage of patients who showed an increase in specific physical complaints in their 
ESM responses ended up reporting these as side effects to their general practitioner. For 
example, increased dizziness was reported by 35 patients using ESM, whereas only seven 
patients reported increased dizziness to their general practitioner. Importantly, ESM- 
reported side effects of medication were associated with decrements in quality of life, and 
patients who showed strong associations between side effects and decrements in quality 
of life were overrepresented among subsequent treatment dropouts. 

Treatment studies using real-time data capture, such as the early studies by Barge- 
Schaapveld and Nicolson (2002), Bauer and colleagues (2005), and Klosko, Barlow, 
Tassinari, and Cerny (1990), are still relatively infrequent (Marks, Cavanagh, & Gega, 
2007). However, it does appear that over the last few years more clinical researchers are 
using these methods to document treatment progress and outcome. In fact, we anticipate 
that more and more studies will use real-time assessment of symptoms in intervention 
and treatment trials, given the U.S. Food and Drug Administration’s (FDA; 2006) recom- 
mendation to do so. Here, we provide a few recent examples. 

For example, Munsch et al. (2009) collected EMA data on 28 overweight females, 
with a body mass index (BMI) between 27 and 40, who met criteria for binge-eating 
disorder (BED). Participants were randomly assigned to short-term cognitive-behavioral 
treatment (CBT) for BED or to a treatment waitlist before receiving short-term CBT 
treatment. Both groups of patients carried an electronic diary for 1 week at both pretreat- 
ment and posttreatment. In addition, the waitlist group carried the electronic diary for 1 
week approximately 2 months before their treatment start date. Participants completed 
reports five times daily (during waking hours, but oversampling later in the day, when 
most binges occur), as well as at event-based prompts (when binge eating occurred). EMA 
reports indicated that CBT for BED was effective in significantly reducing the number 
of weekly binges from pretreatment to posttreatment. Although only modestly related to 
EMA reports of number of binges at pretreatment, retrospective self-reports of weekly 
binges also indicated treatment effectiveness. 

Gehricke, Hong, Whalen, Steinhoff, and Wigal (2009) tested the hypothesis that 
those with attention-deficit/hyperactivity disorder (ADHD) use nicotine to alleviate either 
their inattentive symptoms or high levels of NA associated with comorbid psychopathol- 
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ogy (i.e., a self-medication hypothesis). If true, this might lead to treatment approaches 
involving nicotine replacement therapies for ADHD. Fifty-two unmedicated adults with 
ADHD who were either smokers or nonsmokers were given smoking cessation patches. 
Each participant completed two separate 2-day “treatment” sessions, the first with a 
nicotine patch and a second with a placebo patch. These 2-day sessions were between 1 
and 6 weeks apart and conducted on the same day of the week each time. During these 
2-day sessions, participants were prompted to complete an electronic diary assessment 
twice per hour during waking hours. Participants also wore actigraphs and ambulatory 
blood pressure devices in order to monitor cardiovascular and general physical activ- 
ity. Results generally supported the self-medication hypothesis. ADHD symptoms were 
reduced significantly in response to the nicotine patch. In addition, results indicated sig- 
nificant decreases in anger, stress, and nervousness with nicotine patch administration. 
These findings are especially interesting given that there was no difference in these effects 
between smokers and nonsmokers. Although nicotine does have side effects that must be 
considered (e.g., increased heart rate), these results suggest that, at the very least, clini- 
cians should consider the “positive” benefits of nicotine on ADHD symptoms and mood 
when prescribing smoking cessation in this population. 

Studies have also used EMA methods to collect data from knowledgeable informants 
and observers to assess changes due to treatment. For example, Whalen et al. (2010) 
sought to compare the functioning of children (8- to 12-year-olds) with ADHD on sta- 
ble regimens of either traditional stimulant or atomoxetine (ATMX) pharmacotherapy. 
ATMX, an alternative medication for ADHD, is believed to have more continuous, long- 
lasting benefits throughout the day (and with a lower abuse potential). In addition, the 
authors sought to compare the children’s functioning in the mornings to their functioning 
in the evenings. 

The electronic diary component of the study involved interval-contingent reporting. 
Over the course of a week, the children’s mothers used the electronic diaries to rate their 
own moods and perceptions, as well as their children’s moods. Also, the mothers were 
to estimate the frequency of certain discrete child behaviors, such as needing reminders. 
Each mother selected a time in the morning, after she separated from her child, and in 
the evening, after the child had gone to bed. Also, every 30 minutes, children and their 
mothers independently rated their activities, social setting, and moods during nonschool 
hours. 

Despite their stable medication regimen, the children in both groups continued to 
exhibit clinically significant ADHD symptoms. This is consistent with other research 
showing that pharmacotherapy does not completely “normalize” the behavior of children 
with ADHD. Concerning morning and evening differences, results indicated that moth- 
ers of children in the traditional stimulant (vs. the ATMX) group reported lower levels 
of perceived parenting efficacy and satisfaction, and higher levels of negative affect in 
the morning. Children in the ATMX group were rated better on measures of inattention 
and hyperactivity in the morning (vs. the stimulant group), suggesting that ATMX may 
have more long-lasting effects from the previous day. However, there were no differences 
between these medication groups in the evenings. The authors note the importance of 
these findings, given the tendency for mothers/parents of children with ADHD to report 
a great deal of difficulty with their children in the mornings, likely related to greater time 
constraints and number of necessary tasks to accomplish in the morning. 
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One intriguing possibility is that ESM/EMA assessments may be better able to 
capture treatment effects than global, retrospective self-reports or observer ratings. 
For example, Volkers and colleagues (2002) evaluated the effects of antidepressants on 
24-hour motoric activity in 52 inpatients with major depression. Motoric activity was 
monitored by wrist-actigraphy during a medication-free period and again following 4 
weeks of treatment. The researchers found that clinical ratings of improvement in psycho- 
motor retardation following treatment with fluvoxamine were not supported by the lack 
of change in activity pattern as assessed by wrist-actigraphy. 

Also evaluating treatment for depression, Lenderking et al. (2008) conducted a 
randomized, open-label study to investigate whether an antidepressant response can 
be detected more rapidly with daily assessment than with standard weekly assessment 
approaches. Seventy-eight outpatients with major depressive disorder were followed for 
28 days, while they began a trial of fluoxetine treatment for depression. Patients were 
randomly assigned to a condition involving either once-per-week clinical evaluation or 
daily diary assessment plus once-per-week clinical evaluation. Survival analyses revealed, 
across most outcome measures, that daily diary responses reflected therapeutic effects 
more quickly than did standard weekly clinic assessments. These findings raise the pos- 
sibility that drug and placebo effects might be easier to separate using daily diary assess- 
ment. Furthermore, they suggest that response to depression treatment may be more pre- 
cisely and efficiently captured using ESM/EMA methods. 


Real-Time Feedback and Treatment 


Other exciting possibilities afforded by ESM/EMA data include real-time feedback to 
patients while in their natural environment and the triggering of electronically mediated 
interventions. Collectively, these uses have been termed interactive assessment (Ebner- 
Priemer & Trull, 2009; Fahrenberg, 1996; Shiffman, 2007) or, more recently, ecological 
momentary interventions (Heron & Smyth, 2010). Here we emphasize forms of interac- 
tive assessment that involve either (1) individually tailored moment-specific feedback or 
(2) treatment components. In both cases, feedback, guidance, or treatment is provided 
when indicated by some response or behavior logged by the patient. 

Moment-specific feedback, in a sense, blurs the distinction between assessment and 
treatment. Immediate feedback can advise patients about how to cope while experiencing 
symptoms in daily life. In theory, this feedback is superior to that administered in a stan- 
dard treatment setting, because patients in outpatient settings receive advice and direc- 
tion only once a week, in the clinic. With moment-specific feedback, patients with panic 
disorder, for example, can monitor and address (through breathing exercises) respiratory 
dysregulation as it occurs in their natural environment through the use of a portable 
capnometer (Meuret, Wilhelm, Ritz, & Roth, 2008), a device that can monitor breathing 
rate and adequacy of ventilation. 

Although research has not yet established the effectiveness of interactive ambulatory 
assessment over treatment as usual, there are some promising studies to date (Alpers, 
2009; Intille, 2007). For example, electronic devices have been used to deliver treatments 
for a variety of health and psychological conditions (Heron & Smyth, 2010; Marks et al., 
2007), and have also been used to compare treatments for the same condition. 
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Kenardy and colleagues (2003) compared a therapist-delivered CBT treatment (12 ses- 
sions) with a brief, 6-session therapist-delivered CBT treatment; a computer-augmented, 
6-session, therapist-delivered CBT treatment; and a waitlist group. The palmtop in the 
computer-augmented CBT treatment signaled the participant five times daily to practice 
one of the following therapy modules: self-statement, breathing control, situational expo- 
sure, or interoceptive exposure. Patients in all treatment conditions showed improvement 
compared to the waitlist group. The treatment condition with computer augmentation 
demonstrated several stronger effects for clinically significant change compared to the 
brief treatment module alone. This was, however, only the case at posttreatment, not at 
the 6-month follow-up. The authors also compared treatment costs, which were lowest 
for the brief treatment and the computer-augmented condition compared to the standard 
treatment. Other examples include using handheld computers or mobile phones to admin- 
ister modules related to CBT for depression (e.g., Marks et al., 2003; Osgood-Hynes et 
al., 1998), bulimia (e.g., Norton, Wonderlich, Myers, Mitchell, & Crosby, 2003), social 
phobia (e.g., Przeworski & Newman, 2004), and smoking (e.g., Schneider, Schwartz, & 
Fast, 1995). 

It is worth noting that interactive assessment has been used in patient populations 
that are traditionally considered quite difficult to treat. For example, Solzbacher, Böt- 
tger, Memmesheimer, Mussgay, and Riiddel (2007) targeted affective dysregulation in 
patients with borderline personality disorder, chronic posttraumatic stress disorder, and 
bulimia nervosa. Patients in this study used cellular phones to report the intensity of 
their emotions and distress four times a day, selected randomly, over 3 weeks. Reports of 
high levels of distress prompted a standard suggestion of skills to use to regulate distress. 
Investigators also programmed a follow-up prompt 30 minutes later to assess the utility 
of the intervention. 

Weitzel, Bernhardt, Usdan, Mays, and Glanz (2007) investigated the effects of indi- 
vidually tailored electronic messages on drinking behavior over a 2-week time period 
in college students. Two groups of participants (all of whom drank more than once per 
week) carried handheld computers and logged daily alcohol consumption. In the instant 
message group, participants received daily, individually tailored instant messages about 
avoiding alcohol consequences. These messages were constructed by the experimenters 
based on each participant’s responses to measures of self-efficacy and alcohol conse- 
quence expectancies completed prior to carrying the handheld computers for 2 weeks. 
A second group of participants, the control group, only logged alcohol experiences on 
a handheld computer and did not receive any instant messages. Controlling for baseline 
levels of drinking, the instant message group reported drinking significantly fewer drinks 
over the 2 weeks than did the control group on the handheld computers. Furthermore, the 
instant message group reported lower expectancies of getting into trouble due to alcohol 
use at follow-up than did the control group. 

More recently, McDoniel, Wolskee, and Shen (2010) evaluated a self-monitoring and 
resting metabolic rate technology (SMART) component of a weight reduction treatment 
program. The SMART technology measured resting metabolic rate using a handheld 
device. Based on this assessment, a nutrition program was created for each participant. 
Participants were also given a handheld computer to monitor their food intake and exer- 
cise. Participants included 111 obese, middle-aged adults with a BMI of 30 or higher, 
who were randomly assigned to nutritional education as usual or the interactive SMART 
program. All participants received motivational interviewing in addition to a 12-week 
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weight reduction program. Instead of the handheld computer, individuals in the treat- 
ment as usual condition received a paper-and-pencil diary for monitoring food intake 
and exercise. Results indicated that there was no difference in outcome between groups. 
Both groups improved at follow-up (using both completer and intent-to-treat analyses) by 
reducing body weight and arterial pressure, and increasing motivation. 

As a final example, Patrick and colleagues (2009) conducted a randomized con- 
trolled trial for an instant messaging intervention in overweight adults. Sixty-five adults 
were assigned to one of two conditions, each lasting 4 months: (1) a control group that 
received printed materials about weight control; and (2) an intervention group that 
received personalized text messages two to five times daily, printed materials, and brief 
monthly phone calls from a health counselor. Of note, each person in the intervention 
group received informational texts (weekly topics, e.g., controlling food portions) and 
tips, as well as interactive texts that required a response (e.g., “How often do you use a 
meal plan?”). Results indicated that, adjusting for baseline weight, age, and gender, the 
intervention group lost significantly more weight over the 4 months than the control 


group. 


Conclusions, Prospects, and Challenges 


As we have discussed, ESM/EMA methods can be used in a variety of ways to accom- 
plish many of the goals of clinical psychology. Most notably, these data can provide a 
detailed account and understanding of an individual’s problems as experienced in daily 
life. Patients experience problematic moods, engage in dysfunctional behaviors, and 
experience urges and impulses while living their lives in their natural environment. Clini- 
cal psychologists can get retrospective accounts of these experiences each week, but these 
accounts may not be accurate or precise enough to guide further assessment or interven- 
tion optimally. Moreover, traditional cross-sectional clinical assessment is not able to 
capture the dynamic and time-dependent nature of the clinical phenomena that we seek 
to study (Ebner-Priemer, Eid, Stabenow, Kleindienst, & Trull, 2009). ESM/EMA studies 
of clinical phenomena indicate that moods, behaviors, and urges may wax and wane over 
time, and these changes may be influenced by other events, experiences, and situations, 
to name a few of the possibilities. It is asking too much of both patients and clinicians to 
use retrospective, cross-sectional assessments to piece together the temporal sequencing 
of so many relevant variables. In turn, this traditional approach is likely to result in a 
very simplistic, limited, and perhaps inaccurate account of the problems and concerns of 
our patients. 

Although fewer in number, studies using ESM/EMA methods to evaluate and moni- 
tor treatment progress are promising. Multiple assessments within a day are possible, as 
are assessments across a variety of contexts (e.g., at home, at work, at social events). In 
addition to simple assessments of clinical targets (e.g., moods, behavior), it is also pos- 
sible to query about a variety of contexts, including location, people in the immediate 
vicinity, current activity, expectancies, and so forth. Furthermore, other behaviors and 
experiences during treatment can be monitored, including medication compliance, home- 
work compliance, sleeping and eating behavior, drinking behavior, and health issues. 
This matrix of experiences, behaviors, and events can then be temporally sequenced with 
electronic time stamps. Our brief review of studies that have used ESM/EMA to monitor 
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treatment suggests that these methods can uncover treatment effects more quickly and 
precisely, provide more ecologically valid information about a patient’s condition and 
ability to cope in the natural environment, and offer an alternative perspective on treat- 
ment response that may or may not agree with the ratings or observations of clinicians. 

Interactive assessment is an exciting extension, as the feedback and patient responses 
from ESM/EMA can be used to inform and administer treatment electronically. Treat- 
ments with well-defined components or skills modules seem especially well suited for 
interactive assessment. For example, electronic reminders of previously learned skills or 
instructions for coping can be triggered by indications of distress or problematic mood 
states in patients’ electronically delivered responses. It is also possible to tailor interven- 
tions based on each patient’s symptoms, mood, or psychological profile; this information 
can be collected before treatment, and a catalog of personalized electronic interventions 
can be prepared before treatment begins. The patient’s responses to the ESM/EMA sur- 
veys can then trigger the appropriate intervention. 

Despite the promise of these possibilities, a number of issues must be considered 
before incorporating ESM/EMA methods into clinical research or treatment. First, the 
extent to which both clinical psychologists and patients are comfortable with the tech- 
nology will greatly affect the likelihood of adoption and compliance. In addition, in the 
case of wireless technologies, patients (and clinicians) who live in areas not well served by 
wireless networks may experience some difficulty using handheld computers or mobile 
phones. 

Compliance with an ESM/EMA protocol is a function of many variables includ- 
ing user-friendliness, acceptance, burden of the protocol, and length of study period, 
to name a few. Fortunately, studies routinely document that individuals are open to the 
use of electronic devices and are able to comply with prompts for assessments while in 
their natural environment (Ebner-Priemer & Sawitzki, 2007; Hufford, 2007; Shiffman, 
2009). Many have documented high rates of compliance with ESM/EMA methods in 
clinical studies involving the monitoring of symptoms. For example, a recent multisite 
study of adult patients with schizophrenia or schizoaffective disorder, with substance 
dependence, or with anxiety disorders found that monitoring with electronic diaries was 
well accepted by these patients in terms of participation rates and compliance rates (John- 
son et al., 2009). Furthermore, the authors found little evidence for reactivity or fatigue 
effects across the monitored variables (e.g., mood, environment/social context, activity/ 
behavior). Reports such as this one are encouraging and demonstrate that high rates of 
compliance can be obtained provided that clinical psychologists design studies and treat- 
ment trials that address factors known to lead to poor compliance and build in features 
known to increase compliance (see Hufford, 2007). 

We know less about compliance with recommended interventions in the natural envi- 
ronment. When using interactive assessment with individually tailored moment-specific 
feedback, patients not only have to comply with reports on their experience several times 
a day, but they also are asked to use specific and perhaps (to them) novel skills to allevi- 
ate their symptoms. This is more challenging than simply reporting levels of distress or 
problems. Therefore, alternative assessment strategies (e.g., self-recording or self-video 
gathered by a mobile phone) may be helpful to document the ability of patients to follow 
through on the suggested interventions. 

Another issue to consider is whether to use handheld computers or mobile phones 
in ESM/EMA studies. It now may be advantageous to use software that is compatible 
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with mobile phones (and discontinue the use of PDAs) given the penetration these have 
in the general population. For example, it is estimated that up to 85% of the world’s 
population has access to a mobile signal (Kwok, 2009). Today, mobile phones can be 
used for two-way communication in real time or near real time (text messaging, voice), 
and many phones now have video, WiFi, Bluetooth, and even computing capabilities. In 
addition to the increased capability and distribution of mobile phones, another impor- 
tant reason for developing phone applications for ESM/EMA studies is that individuals 
prefer to use hardware they own and with which they are familiar, instead of having to 
carry an additional electronic device. Although this “switch” sounds straightforward in 
theory, the lack of a universally adopted operating system for mobile phones presents a 
major problem; companies vary as to the operating systems that can be used with their 
own phones (Microsoft Windows Mobile, BlackBerry, Palm OS, Mobile Linux, Android, 
etc.). Because patients are likely to subscribe to a variety of phone services, the clinical 
researcher or clinician is forced to choose a platform and provide equipment and software 
to those whose existing mobile phones are not compatible with this platform. 

In addition to the practical issues of adopting and using this methodology, there 
are a number of questions to be addressed by future research. First, does ESM/EMA 
show added value or incremental validity over existing methods? Studies have indicated 
that data from ESM/EMA methods do not always agree with those collected via cross- 
sectional questionnaire or interview (e.g., Solhan et al., 2009). However, it remains to 
be demonstrated whether ESM/EMA data are more valid or predict treatment response 
better than data from traditional assessment measures (e.g., questionnaire scores). 

Second, is the cost-benefit ratio favorable? Using ESM/EMA methods can be time- 
intensive for patients and costly for clinicians and clinical researchers. Therefore, a dem- 
onstration of savings in time for the clinician, as well as incremental gain in knowledge, 
is necessary before using these methods routinely. 

Third, there are several issues concerning interactive assessment and treatment using 
ESM/EMA. Devices such as PDAs and mobile phones can collect data very efficiently. 
However, it is not clear how best to present or graphically display these data to both 
patients and clinicians to maximize understanding of the clinical problems, to aid appro- 
priate formulation of treatment, and to provide timely feedback on progress to improve 
the likelihood of good outcomes. In addition, some treatments that have well-defined and 
easily summarized components are more amenable to interactive assessment and treat- 
ment. For example, cognitive-behavioral treatments that involve developing and practic- 
ing behavioral, cognitive, and interpersonal skills could be adapted to facilitate real-time 
intervention in the natural environment. Other forms of psychological treatment may 
not be adaptable to this format. More randomized controlled trials are needed to estab- 
lish the efficacy and effectiveness of treatments administered in an interactive, electronic 
format while patients are in their natural environment. To date, few studies have been 
conducted to establish the efficacy of “ecological momentary interventions” in patient 
populations (Heron & Smyth, 2010). 

Finally, there are several ethical and professional considerations. As with all elec- 
tronic devices and communications used to document health status, privacy is a concern. 
Some of these concerns can be allayed by using password-protected devices and proto- 
cols, using data encryption and secure servers to house data, and providing Health Insur- 
ance Portability and Accountability Act (HIPAA) training to all those who have access to 
patient data. Other issues are perhaps more complex, such as how best to disclose data 
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collected passively (e.g., through body sensors) if concerns arise, how best to respond to 
crises or emergencies that occur in the field (e.g., suicidality), and how to protect data that 
may indicate illegal activity (e.g., underage drinking, use of illicit substances). Regard- 
ing professional concerns, clinical psychology has experienced a long-standing debate 
over the necessity of a doctoral degree to practice independently. It seems reasonable to 
assume that there may be some concern about administering treatment electronically to 
multiple patients using predetermined interventions guided by a decision-making algo- 
rithm. In many ways, this issue is reminiscent of the clinical versus statistical prediction 
debate (man vs. machine) several decades ago. Ultimately, empirical data will (or will not) 
support the use of ESM/EMA methods to assess and treat our patients. 

In summary, real-time assessment and interactive treatment raise a number of excit- 
ing possibilities that are likely to impact the field of clinical psychology. Both electronic 
diaries and mobile phones are now able to transfer information in near-real time. Fur- 
thermore, these devices can serve not only as a platform for patients to answer queries 
but also as a means of collecting audio, video, geographical positioning, and (through 
attachments) some physiological and biological data (e.g., blood alcohol content). Data 
from wearable devices and biosensors are likely to be integrated soon with information 
gathered via electronic devices and mobile phones to provide a comprehensive picture of 
patients’ emotional, psychological, behavioral, and physical functioning in their natural 
environment (see Chapter 15 by Intille, and Chapter 14 by Goodwin, this volume). 

This area, however, is still in its infancy, and several important questions need to 
be answered before these techniques are incorporated into routine practice. Clinicians 
and clinical researchers may be more resistant than patients to the idea of implement- 
ing real-time assessment and intervention methods. Some barriers to implementation are 
quite understandable, including cost of devices, the need for knowledge about software 
and programming, and the challenge of how best to view or analyze the data gathered 
(even for a single case). Although perhaps initially daunting, we do not believe that these 
challenges are insurmountable. As more clinicians and clinical researchers integrate these 
methods into their work, more guidance and resources (e.g., freeware, how-to manuals) 
will likely become available. 


Note 


1. It is important to note that the terms experience sampling method (ESM), ecological momentary 
assessment (EMA), and ambulatory assessment are often used interchangeably. However, there are 
distinctions between these methods. EMA has its roots in self-monitoring approaches and in ESM 
in particular; however, it is viewed as a broader methodology that attempts to integrate a number of 
assessment traditions with similar goals (Shiffman, Stone, & Hufford, 2008). Furthermore, techno- 
logical advances have expanded the assessment targets of EMA beyond self-reported subjective states 
and behavior to the sampling and monitoring of physiological processes (e.g., heart rate, respiration; 
often referred to as ambulatory assessment) and of behaviors or states recorded or “observed” by 
electronic devices (e.g., pill taking, audio recordings, video recordings). Here, we use the designation 
ESM/EMA as an umbrella term to refer to all of these “daily life” sampling approaches that obtain 
multiple measures over time and emphasize real-time or near-real-time assessment. 
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CHAPTER 36 


Psychiatry 


INEZ MYIN-GERMEYS 


he essence of psychiatric symptoms is that they are natural experiences emerging in 

the realm of normal daily life, often in interaction with contextual features. This cen- 
tral characteristic, however, is often overlooked. With increasingly sophisticated methods 
to investigate brain functions, genetic mechanisms, or neuropsychological performance, 
we know disappointingly little about the expression of psychopathological symptoms in 
daily life and the dynamic interplay between the person and his or her environment in 
psychiatry. 

Fortunately, there is a growing awareness that the study of psychiatric patients in 
the context of normal daily life may provide a powerful and necessary addition to more 
conventional and cross-sectional research strategies. A growing body of studies is using 
techniques such as experience sampling methods (ESM) or ecological momentary assess- 
ment (EMA) to study psychopathology in real life. This chapter consists of practical 
guidelines for ESM relevant to studying psychiatric patients, as well as an overview of 
topics in psychiatry for which daily life studies may be particularly relevant. 


Practical Issues Related to Daily Life Research in Psychiatry 


Questionnaire Development 


Apart from the general rules in questionnaire development for daily life studies, such 
as using momentary items rather than trait-like items, using self-exploratory items with 
short cues, and using items that reflect upon commonplace situations (see Mehl & Rob- 
bins, Chapter 10, and Goodwin, Chapter 14, this volume), a few additional issues need 
to be kept in mind when developing an ESM questionnaire for use in psychiatric patients 
(Palmier-Claus et al., 2011). 

First, the language used in the questions should reflect how people describe their own 
behavior and experiences. One should avoid the psychological lexicon that is unfamiliar 
to laypeople. Questions about constructs such as attributions, coping, or dissociation are 
often disliked, and reliability of the responses is questionable. 

Second, not all psychiatric constructs of interest are easy to capture. People are not 
always aware of their symptoms, yet we do rely on self-report in our diary studies. There- 


636 


Psychiatry 637 


fore, one need to select items that include aspects of mental states of which subjects are 
aware and that have been directly associated with the experiences in which one is inter- 
ested. For example, delusional ideation, which is associated with poor insight, is hard to 
assess using self-report. However, items such as “Preoccupation with your thoughts,” 
“Feeling suspicious,” or “Feeling controlled” include aspects of mental states that are 
directly associated with delusional ideation in psychosis and about which people can self- 
report (Myin-Germeys, Delespaul, & van Os, 2005). 

Third, a crucial issue within psychiatry is reducing reactivity to the method (see 
Barta, Tennen, & Litt, Chapter 6, this volume). The frequent assessment of mood (e.g., 
“When you feel depressed”) or symptoms may influence the person’s experience and 
indeed exacerbate the symptom intensity. It is therefore crucial to construct the diary 
carefully to reduce this reactivity to the method. A good strategy is to intersperse spe- 
cific with more general items, so that individuals are not solely provided with emotion- 
ally salient information (Palmier-Claus et al., 2011). For example, assessing not only 
affect and symptoms but also contextual information (e.g., “Where am I?”, “What am I 
doing?”, “Whom am I with?”) ensures that the participant is not focusing solely on his/ 
her own disorder. 

Fourth, researchers should avoid reflective statements, such as “In this social context, 
I feel down.” Much more information can be gained when mood and context questions 
are asked independently and the association is made statistically by the researcher after- 
wards. This not only sheds light on patterns of which participants are not consciously 
aware but also limits the risk of participants giving socially desirable answers. 


Validity, Feasibility, and Compliance 


A major issue when using ESM to study psychopathology is whether patients with severe 
mental disorders are able to provide reliable and valid self-reports of their current cog- 
nitions, affect, behavior and environmental context. The complexity of the ESM may 
result in a selection bias, with only better-performing patients able to participate in ESM 
studies. 

First, ESM studies have been successfully conducted in a variety of patients with less 
and more severe psychopathology. Patients with depression (Peeters, Nicolson, Berkhof, 
Delespaul, & De Vries, 2003; Wichers, Schrijvers, et al., 2009), eating disorders (Engel et 
al., 2005), panic disorder (Dijkman-Caes, De Vries, Kraan, & Volovics, 1993), attention- 
deficit/hyperactivity disorder (ADHD; Knouse et al., 2008), and borderline personality 
disorder (Glaser, van Os, Mengelers, & Myin-Germeys, 2008, Stiglmayr et al., 2008) have 
all been studied with ESM (see Myin-Germeys et al., 2009, for a review). ESM has been 
very extensively used in schizophrenia research (Oorschot, Kwapil, Delespaul, & Myin- 
Germeys, 2009) and in studying acute paranoid patients (Thewissen, Bentall, Lecompte, 
van Os, & Myin-Germeys, 2008) and heavy cannabis users (Henquet et al., 2010), as 
well as high-risk subjects and patients with psychosis in remission (Myin-Germeys, van 
Os, Schwartz, Stone, & Delespaul, 2001). Overall, acceptable numbers of valid reports 
were found across all these groups. Although more ill populations may show higher drop- 
out rates, acceptable numbers can be achieved (Thewissen et al., 2008). 

Second, the reduction of recall biases, due to the prospective and real-time assess- 
ments of mood and psychopathology, may be particularly relevant in psychiatric popu- 
lations. Psychiatric patients that have decreased cognitive capacities may even increase 
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their proneness to bias. Furthermore, it is much easier to reflect upon the moment than to 
conduct a global assessment over a period of time. 

Third, it can be argued that the capacity for self-report is diminished in patients with 
severe mental disorders such as autism or psychosis, because they lack insight (Debowska, 
Grzywa, & Kucharska Pietura, 1998). ESM studies, however, have found meaningful 
variation in a number of domains, including mood and symptoms, even in patients with 
severe psychiatric problems. 

Compliance with the research protocol is another important issue in momentary 
assessment research. Compliance in these studies involves several aspects, such as com- 
pleting an adequate number of assessments, completing assessments at the time signaled, 
and accurately completing the protocols. Compliance in patients with psychosis was spe- 
cifically investigated in one ESM study using palmtop devices. Almost 90% of partici- 
pants with schizophrenia completed on average over two-thirds of the electronic ques- 
tionnaires. This compliance rate was only somewhat lower than that in previous studies 
using computerized ESM in nonpsychiatric (90-96%) and higher-functioning psychiatric 
samples (86-92%) (Granholm, Loh, & Swendsen, 2008). These rates indicate that people 
with severe mental disorders such as schizophrenia are willing and able to participate in 
ESM studies. Compliance can further be enhanced by careful design and implementa- 
tion. 


Phenomenology 


Psychiatric phenomenology has not been fully described or understood. Knowledge 
about essential characteristics of psychiatric phenomena, such as the frequency of their 
occurrence, variability over time, and interplay with the context in which they emerge, 
is lacking to a great extent. Similarly, the ecological validity of some of the psychiatric 
constructs, as we use them today, is not clear. ESM technology may be instrumental in 
improving our understanding of psychiatric phenomena. 


Testing the Ecological Validity of Psychiatric Constructs 


Diagnostic classification systems describe psychiatric disorders and their characterizing 
symptoms. However, the behavioral counterparts or consequences of these symptoms are 
not always clear. ESM studies could improve our understanding of these symptoms, as 
well as test their ecological validity. Do they unfold in real life as one would expect? Panic 
disorders, for example, may occur with or without agoraphobia. An ESM study found 
that panic subjects with agoraphobia did spend significantly more time at home and with 
their families than did panic patients without agoraphobia and normal controls. How- 
ever, they found no difference in the amount of time these two patient groups spent in 
public places, thus challenging diagnostic conceptualizations of panic disorder subtypes 
based on retrospective reports (Dijkman-Caes et al., 1993). 

ESM may also be instrumental in improving our understanding of the true nature 
of these psychiatric symptoms. Anhedonia, an incapacity to enjoy pleasure, for exam- 
ple, has been considered a core negative symptom of psychosis. However, prospective, 
repeated assessments in daily life may unravel the different components of this construct. 
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One study showed that anhedonia was more related to a deficit in anticipatory plea- 
sure (pleasure related to future events) than to consummatory pleasure (pleasure when 
directly engaging in an activity) (Gard, Kring, Gard, Horan, & Green, 2007). A recent 
study revealed that patients with schizophrenia were able to enjoy pleasant events as 
much as controls. However, the patients did not experience as many pleasant events as 
controls (Oorschot et al., 2013). In the same study, Oorschot and colleagues showed that 
patients with negative symptoms may deviate from controls in the way they experience 
the social environment, suggestive of social anhedonia. The concept of social anhedonia 
was validated in daily life in a general population sample. Social anhedonia is charac- 
terized by increased time spent alone, a greater preference for solitude, and less positive 
affect (Brown, Silvia, Myin-Germeys, & Kwapil, 2007); this is in contrast with social 
anxiety, which was associated with increased preference to be alone when interacting 
with unfamiliar people but not with increased time alone. A similar approach was used 
to investigate social needs in adult patients with pervasive developmental disorder or 
autism. These patients showed no sign of social disinterest, as is seen in social anhedonia. 
Rather, they reported social anxiety, with increased levels of anxiety and negative affect 
when in the company of strangers, but they were able to enjoy the company of relatives 
or friends (Hintzen, Delespaul, van Os, & Myin-Germeys, 2010). ESM data may provide 
better insight into the true nature of psychiatric phenomena as they unfold in daily life. 


Frequency of Occurrence 


People are often poor at judging retrospectively the frequency and intensity of painful, 
overwhelming, or emotionally salient experiences. Yet we do rely almost entirely on ret- 
rospective assessments to define frequency of symptoms. The prospective design of daily 
life studies may provide a much more reliable assessment strategy to understand the fre- 
quency of occurrence at both the level of the individual person (e.g., in clinical practice) 
or group level for research purposes. An example can again be found in a study of panic 
disorder. De Beurs, Lange, and Van Dyck (1992) demonstrated that self-monitored inci- 
dence of panic attacks was lower than incidence rates obtained through retrospective 
estimation. Similarly, in patients diagnosed with obsessive-compulsive disorder (OCD), 
retrospective recall as opposed to EMA led to overestimation of symptom covariation 
with nonsymptomatic variables such as stress and loneliness (Gloster et al., 2008). 

Frequency of psychotic symptoms has also been studied with ESM. Contrary to what 
was expected based on psychiatric interviews and retrospective recall, visual hallucina- 
tions were reported most frequently whereas auditory hallucinations were experienced 
with a higher intensity (Delespaul, De Vries, & van Os, 2002). Delusional moments, on 
the other hand, which are expected to be quite stable over time, were found to be present 
only one-third of the time and were accompanied by high negative and low positive affect 
(Myin-Germeys, Nicolson, & Delespaul, 2001). 


Dynamics and Variability over Time 


The pathological nature of experiences may not necessarily be found in the content of 
the experience itself, but may rather be reflected in the duration, intensity, or variability 
of the experience. Feeling down at a particular moment in time may be a normal and 
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healthy experience. However, when these feelings of depression persist over time, or when 
they are interchanged with moments of extreme happiness, this may indicate a psycho- 
pathological process. The longitudinal nature of daily life studies may shed light on these 
dynamic patterns over time. 

First, ESM can be used to study diurnal patterns of experiences over the day. This 
has been done in the study of mood disorders, since little is known about how depression 
alters the dynamics of mood states in daily life. Peeters, Berkhof, Delespaul, Rotten- 
berg, and Nicolson (2006) found that negative affect shows a more pronounced diurnal 
rhythm in depressed patients compared to controls. They also reported that depressive 
persons exhibit increasing positive affect levels over the day, with a later acrophase (i.e., 
peak in the circadian fluctuation) relative to healthy controls. 

Second, dynamic and reactive mood changes over time are included as defining char- 
acteristics of psychiatric disorders such as bipolar disorder or borderline personality dis- 
order, yet very little is known about the extent of this variability and how it occurs in real 
life. Borderline personality disorder has been extensively studied (see also Ebner-Priemer 
& Trull, Chapter 23, this volume) with ESM. Patients were found to report increased 
levels of intraindividual variability and short-term fluctuations in overall affect valence 
(Ebner-Priemer et al., 2007; Trull & Ebner-Priemer, 2009). Only a few studies have been 
done in patients with bipolar disorder, but these studies also confirmed strong fluctua- 
tions both in affect and in self-esteem (Knowles et al., 2007). 

Interestingly, similar variability was found in psychiatric conditions that are con- 
sidered more stable over time. Peeters and colleagues (2006) found more variability in 
moment-to-moment negative mood in depressed patients compared to controls. Depres- 
sive symptoms and problem behavior have also been related to intense and labile emo- 
tions, and less effective regulation of these emotions in adolescents (Silk, Steinberg, & 
Morris, 2003). Psychotic patients were characterized by a higher variability in mood 
(Myin-Germeys, Delespaul, & De Vries, 2000; Oorschot et al., 2013) and self-esteem 
(Thewissen et al., 2008) compared to controls. Interestingly, positive psychotic symp- 
toms, such as delusions and hallucinations, were found to be highly variable in intensity 
over time (Delespaul et al., 2002; Myin-Germeys, Nicolson, et al., 2001). 

Overall, these data show that one of the most central characteristics of almost all 
symptoms observed in patients with severe psychiatric disorders is meaningful and wide- 
spread variation over time. It is important to understand that cross-sectional instruments 
are unable to assess this variation. As a consequence, data obtained with these instru- 
ments may not be linearly related to ESM findings. 


Aetiology: Determinants of Variability 
in Psychiatric Symptoms 


In order to improve our understanding of psychiatric disorders, as well as the treat- 
ment of psychiatric symptoms, it is crucially important to investigate both the internal 
and situational determinants of the observed variation in symptomatology over time. 
Momentary assessment strategies may constitute a powerful tool to examine the under- 
lying mechanisms related to the onset and maintenance of psychopathology, especially 
when these mechanisms involve changes with regard to how persons react to or behave in 
certain situations or environments. 
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Internal Determinants 


ESM consists of prospective, longitudinal data that provide the opportunity examine 
temporal dynamics between the relevant experiences, possibly giving clues as to the direc- 
tion of causality. With time-lagged approaches, we can unravel how internal determi- 
nants, such as within-person changes in mood or self-esteem, may lead to the onset of 
symptoms, as well as investigate the consequences of these symptoms over time. 

The temporal association between mood and self-esteem has been investigated in 
relation to paranoid ideation. Several proposed psychological models of paranoia have 
assuming a causal role of anxiety and low self-esteem in the onset of paranoia (Garety, 
Bebbington, Fowler, Freeman, & Kuipers, 2007). However, all of these were based on 
cross-sectional studies. ESM studies may be helpful in unraveling the directionality of 
relationships between mood and symptoms. A recent study by Ben-Zeev, Ellington, 
Swendsen, and Granholm (in press) found anxiety and sadness to be predictive of sub- 
sequent persecutory delusions, with anxiety related to both delusional belief conviction 
and associated distress, while sadness was predictive only of distress. In a similar time- 
lagged approach, Thewissen and colleagues (2008) found decreases in self-esteem to be 
predictive of subsequent paranoia. In a second study, Thewissen and colleagues (2011) 
used a more sophisticated approach. They investigated episodes of paranoia, defined as 
consecutive moments of increased paranoia intensity. They found that increased anxiety 
and decreased self-esteem predicted subsequent onset of a paranoid episode, with sadness 
and anger also increased during the episode. They also found that level of intensity and 
associated depression at the first moment of an episode predicted longer duration of the 
episode (Thewissen et al., 2011). 

A similar sophisticated approach was used by Wichers and colleagues, who inves- 
tigated temporal dynamics between positive and negative affect in depression (Wichers, 
Lothmann, et al., 2012) as well as the effect of physical activity on positive affect (Wich- 
ers, Peeters, Rutten, et al., in press). First, they showed that a more favorable course of 
depression was associated with a better baseline suppression of negative affect following 
increases in positive affect over the course of the day (Wichers, Lothmann, et al., 2012). 
In a second study, Wichers, Peeters, Rutten, et al. (in press) used time-lagged analyses 
to show that positive affect increased at the moments following an increase in physical 
activity. This increase remained for up to three beeps (in their study, up to 4.5 hours). 
Interestingly, depressed patients gained much less from being involved in physical activ- 
ity. Their level of positive affect was increased for only 1.5 hours. 

Overall, these data show that ESM may be utterly instrumental in understanding 
the dynamic temporal processes that underlie the development of psychopathological 
symptoms. This may not only fuel new research in the area of psychological models but 
also be essential in improving psychological therapies. However, it is clear that we need 
more sophisticated statistical approaches to grasp the complex interplay between several 
dynamic variables and to unravel the temporal associations, possibly giving clues as to 
the direction of causality. 


Contextual Determinants 


Variability over time may not be triggered only by internal changes; some of the mecha- 
nisms underlying psychopathology may be found in the dynamic interplay between the 
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environment and the person. The longitudinal character of the data, and the assessment 
of both experiences and environmental features, allows investigation of subtle person- 
environment interplays about which subjects are not always consciously aware. 


SOCIAL CONTEXT 


The social context is relevant in many psychiatric disorders. Family hassles have been 
related to binges in people with eating disorders (Okon, Greene, & Smith, 2003). Bulimic 
individuals also showed larger increases in self-criticism following negative social inter- 
actions, suggesting a hypersensitivity to social interactions in bulimia (Steiger, Gauvin, 
Jabalpurwala, Seguin, & Stotland, 1999). The social environment has also been associ- 
ated with psychotic symptoms. The intensity of hallucinatory experiences was found to 
be influenced by several contextual moderators, including social context and activity level 
(Delespaul et al., 2002). The importance of the dynamics of social context in the onset of 
psychotic experiences was emphasized by another ESM study (Verdoux, Gindre, Sorbara, 
Tournier, & Swendsen, 2003), which showed that dynamic changes (e.g., transition from 
the presence of familiar to nonfamiliar individuals) in social company were stronger pre- 
dictors of psychotic experiences in persons at risk for psychosis than social contact per 
se. In subjects with remitted psychosis, delusional moments were influenced by the social 
company, with lower levels of delusions in the presence of family and increased intensity 
in the presence of strangers (Myin-Germeys, Nicolson, et al., 2001). In a recent study, 
these findings were further refined, showing that for people with low or moderate levels 
of trait paranoia, the social context indeed was relevant, with increased levels of paranoid 
ideation when in the presence of less-well-known people compared to being with family 
or friends (Collip, Oorschot, et al., 2011). For people with high levels of trait paranoia, 
however, the context became irrelevant. High levels of paranoia were reported irrespec- 
tive of the person’s social company. These data show that psychosis is driven by subtle 
person-environment interactions in the stream of daily life. Knowledge of the contextual 
features that influence symptoms may be helpful in assisting patients to develop better 
coping strategies and in creating therapeutic interventions. 


STRESS REACTIVITY AND REWARD RESPONSIVITY 


Stress has been postulated to play a role in all major psychiatric illnesses. Whereas pre- 
vious studies focused on either the stressor or the vulnerability, current views stress the 
importance of the interplay between the occurrence of an event, and the appraisals and 
emotional reactions in the receiver. ESM may be ideally suited to study the interplay 
between subjective appraisal of naturally occurring events, and the emotional and symp- 
tomatic reactions to these events. Similarly, reward responsivity has been defined as the 
ability to generate positive affect from positive events happening in daily life. 

The occurrence of stress and increased stress sensitivity has been particularly pro- 
posed in the pathogenesis of affective disorders. An ESM study in patients with current 
major depression examined the occurrence of both positive and negative daily hassles 
in the context of daily life. Patients with major depression experienced fewer positive 
events but the same amount of negative events compared to controls, although both types 
of events were rated as more stressful (Peeters et al., 2003). Increased stress sensitivity, 
defined as increased negative affect in reaction to stressful daily events, was associated 
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with genetic risk for depression and predicted new, future depressive symptomatology 
in a general population twin sample (Wichers, Geschwind, et al., 2009; Wichers et al., 
2007b). However, increased reward experiences protected against negative affective 
symptoms at follow-up (Wichers et al., 2010). 

Sensitivity to daily life stress has also been extensively investigated in relation to psy- 
chosis. In these studies, stress was conceptualized two different ways: (1) as daily hassles, 
similar to what has been done in studies of affective disorder and (2) as even smaller, 
momentary disturbances related to the activity in which someone is involved (e.g., doing 
something that one is not motivated to do or not good at, or that costs effort). It has 
been demonstrated that both daily hassles and smaller, activity-related stress appraisals 
are associated with increased emotional reactions in persons with an increased liability 
for psychosis (e.g., subjects from the general population with higher levels of subclinical 
psychosis, first-degree relatives of patients with psychosis, and patients with psychosis in 
a current state of remission) (Lataster et al., 2009; Myin-Germeys, van Os, et al., 2001). 
These subjects also showed continuous variation in the intensity of subtle psychotic expe- 
riences associated with minor stresses in the flow of daily life (Myin-Germeys et al., 
2005). 

ESM has also been used to understand the mechanism underlying this increased sen- 
sitivity to stress. It has been argued that both environmental and genetic factors (see the 
section on gene-environment interactions) may contribute to the increase in stress sen- 
sitivity. Collip, Myin-Germeys, and van Os (2008) tested the hypothesis that increased 
stress sensitivity may be the result of a sensitization process. Sensitization means that 
individuals who are exposed repeatedly to an environmental risk factor may develop 
progressively greater responses over time, finally resulting in a lasting change in response 
amplitude, or they may need progressively smaller stressors to elicit a similar response. 
ESM allows for the study of increased amplitudes in relation to previous exposure to 
environmental events. Several studies demonstrated that exposure to traumatic events 
in childhood is associated with increased reactivity to daily life stress as assessed with 
ESM. This has been demonstrated in general population studies (Glaser, van Os, Porte- 
gijs, & Myin-Germeys, 2006; Wichers, Schrijvers, et al., 2009) as well as in patients with 
psychosis (Lardinois, Lataster, Mengelers, van Os, & Myin-Germeys, 2011). Similarly, 
in a study of patients with psychosis, it was demonstrated that previous exposure to life 
events modified the emotional reaction to daily life stress (Myin-Germeys, Krabbendam, 
Delespaul, & van Os, 2003). These studies suggest a pattern of sensitization in which the 
effects of early stress may give rise to a lasting behavioral lability. 


CANNABIS USE 


Cannabis use is another contextual factor that has been associated with the onset of 
psychotic disorders. Although cannabis can be administered in experimental settings, in 
controlled conditions with precise dosages of THC, such studies do not provide insight 
into patterns of use or the effects of real-life cannabis use or symptoms as they occur in 
the context of daily life. A few ESM studies have investigated the association between 
real-life cannabis use and psychotic symptoms. Verdoux and colleagues (2003) found 
that in undergraduate students, cannabis immediately increased the intensity of psychosis 
and decreased feelings of pleasure, especially in participants with high vulnerability for 
psychosis. Henquet and colleagues (2010) extended the finding that cannabis increases 
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hallucinatory experiences in patients with psychosis. However, they found that patients 
are also more sensitive to the mood-enhancing effects of cannabis. They used the pro- 
spective nature of the data to disentangle immediate effects (the beep after cannabis use) 
from subacute effects (one beep later). The data suggested an immediate reward effect of 
cannabis on mood, whereas the effect on hallucinations was subacute, possibly explain- 
ing why patients with psychosis continue to use cannabis (Henquet et al., 2010). Both 
studies found no evidence for the self-medication hypothesis, since cannabis use was not 
predicted by previous mood or intensity of psychosis. These findings provide a better 
insight into the motives for and mechanisms of cannabis use, yielding essential informa- 
tion for the design and improvement of treatment strategies. 


Treatment 


Momentary assessment strategies may also have clear advantages in clinical psycho- 
pharmacology and treatment studies. Several advantages may be envisaged. First, since 
ESM provides more accurate information retrieval and has superior ecological validity 
(Moskowitz & Young, 2006), the effects of treatment could be assessed in a much wider 
range of behavior. For example, Barge-Schaapveld, Nicolson, van der Hoop, and de Vries 
(1995) found that clinical improvement in depression was reflected in increased house- 
hold activities, less passive leisure time, and increased positive and decreased negative 
affect. 

Second, the collection of many data points increases sensitivity to detect change. 
One study showed that daily diaries are more effective in detecting therapeutic action of 
antidepressant medication than standard weekly clinic assessments (Lenderking et al., 
2008). A study of imipramine (a tricyclic antidepressant agent) treatment on momen- 
tary quality of life revealed that the treatment condition stabilized rather than changed 
overall levels of quality of life and decreased the level of inactivity (Barge-Schaapveld & 
Nicolson, 2002). Treatment also decreased sensitivity to stress and, more specifically, 
increased reward experience, defined as increases in positive affect in reaction to posi- 
tively appraised activities (Wichers, Barge-Schaapveld, et al., 2008). Response to treat- 
ment in depression may thus be conditional on restoration of hedonic capacity. At the 
same time, ESM may allow the capture of more subtle side effects at an earlier stage. One 
study compared the mood-decreasing effects of higher dosages of antipsychotics, espe- 
cially those binding tightly to the dopamine receptor (D,). It was found that for tightly 
bound antipsychotics, such as risperidone and haloperidol, increasing levels of estimated 
D, occupancy was associated with decreased positive affect and increased negative affect 
in real life. For users of loosely bound agents, no such association was apparent (Lataster 
et al., in press). These studies show how ESM may provide more fine-grained informa- 
tion on treatment effects and adherence, and more insight into psychological mecha- 
nisms underlying response to treatment. Despite these clear advantages, few studies have 
applied ESM to investigate treatment effects in psychopathology. 

Third, ESM data could also be used in the context of clinical therapy. ESM technol- 
ogy could provide an ideal tool to unravel patterns of behavior of an individual patient, 
for example, in the context of cognitive-behavioral therapy. ESM has already been applied 
successfully in anxiety and eating disorders (see Heron & Smyth, 2010, for a review). A 
second and more ambitious option would be to apply ESM directly as an instrument for 
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therapeutic intervention. Direct feedback at the level of normal daily functioning may be 
a powerful tool to help people regain control over their emotions and experiences (Heron 
& Smyth, 2010; Myin-Germeys, Birchwood, & Kwapil, 2011). 


Gene-Environment Interactions 


There is increasing consensus that neither genes nor environmental factors alone, but 
rather interplay between genes and environment, may sufficiently explain the development 
of complex psychiatric disorders (Caspi & Moffitt, 2006; van Os, Rutten, & Poulton, 
2008). Gene-environment (G x E) interactions implicate susceptibility genes in ampli- 
fying an individual’s risk to react with psychopathology in response to environmental 
pathogens, or vice versa, environmental factors moderate the effects of genes on psycho- 
pathology. Whereas there is an increasing level of sophistication at the level of identifying 
target genes, with genomewide association studies identifying polymorphisms, as well as 
copy number variants (for a review, see van Winkel et al., 2010), a similar optimization 
is needed at the level of the assessment of the environmental exposure. Moffit, Caspi, and 
Rutter (2005) suggested that environmental risk factors (1) should be proximal rather 
than distal environmental factors, (2) should be assessed prospectively rather than retro- 
spectively, and (3) should be assessed cumulatively and repeatedly over time. Since ESM 
technology fulfills all of these criteria, implementing this methodological improvement 
may be advisable. An additional advantage is that ESM data, with multiple assessments 
per person, provide more statistical power to detect G x E interactions. 

One may use indirect approaches to study G x E interactions, for example, with twin 
studies providing indirect assessments of genetic risk. Jacobs and colleagues (2006), for 
example, used structural equation modeling to demonstrate a small genetic influence on 
the dynamic relationship between minor stress and affective responses in the flow of daily 
life in a general population twin sample. In the same study sample, it was demonstrated 
that probands whose co-twins were diagnosed with lifetime depression showed a stron- 
ger mood bias to real-life stress than probands with co-twins without such a diagnosis, 
thus providing evidence that the genetic liability to depression is in part expressed as the 
tendency to display negative affect in response to minor stressors in daily life (Wichers 
et al., 2007b). Interestingly, however, the experience of positive affect associated with 
stress attenuated this endophenotypic expression of genetic vulnerability for depression 
(Wichers et al., 2007a). 

An alternative approach is study assessing direct G x E interactions between genetic 
polymorphisms and environmental factors. Thus far, only a few studies have used ESM 
to study direct G x E interactions in psychopathology. The environmental pathogens 
studied so far are limited to minor stressors in daily life, real-life cannabis use, and 
reward experiences. G-stress interactions have been studied in relation to psychosis, 
depression, and anxiety. One study in patients with psychosis reported an interaction 
between the COMT Val!*8Met genotype and daily stress on negative mood and psycho- 
sis. The COMT gene encodes the enzyme catechol-O-methyltransferase, which plays an 
important role in the degradation of dopamine in the brain. Patients with the Met/Met 
genotype (with a low-activity enzyme) showed the largest increase in psychotic symp- 
toms and negative emotions in response to daily stressors (van Winkel et al., 2008). 
This has recently been replicated in a larger sample of patients with psychotic disorders 
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(Collip et al., 2011). One study in the general population, on the other hand, found 
opposite results with Val/Val carriers (with the high-activity enzyme) displaying more 
feelings of paranoia in response to stress (Simons et al., 2009). Gene-stress interactions 
have also been investigated in depression, showing that positive affect decreased genetic 
moderation by a polymorphism in the gene encoding brain-derived neurotrophic factor 
(BDNF—a factor known to be associated with stress-related behaviors) of the negative 
affect response to social stress (Wichers, Kenis, et al., 2008). Finally, one study using a 
once-a-day report found increased anxiety levels in reactions to stress in subjects with 
at least one copy of the S (short) or Lg (long g) allele (both associated with less efficient 
transcription of the gene) at S-HTTLPR, a polymorphism in the promotor region of the 
serotonin transporter gene, compared to those who were La (long a) homozygotes, sug- 
gestive of G x E interaction (Gunthert et al., 2007). 

Real-life cannabis use has also been studied as an environmental pathogen in G x 
E studies. One study in patients with psychosis demonstrated that the COMT Val carri- 
ers, but not subjects with the Met/Met genotype, reported an increase in hallucinations 
in response to real-life cannabis use (Henquet et al., 2009). Finally, one study examined 
the moderating effects of the COMT Val'5®Met genotype on reward experience, defined 
as the effects of event appraisals on positive affect. The ability to experience reward 
increased with the number of Met alleles of the subject (Wichers et al., 2008). 

These initial studies show the feasibility and power of ESM data for high-quality 
G x E interaction studies examining the immediate effect of risk exposure conditional 
on genetic vulnerability for psychopathology in general, as well as conditional on specific 
susceptibility polymorphisms. 


Conclusion 


This chapter has shown that variability over time and dynamic interplay with the envi- 
ronment are essential features of psychopathological experience that need to be captured 
for a better understanding of its phenomenology and underlying mechanisms. Since vari- 
ability is very difficult to assess with traditional instruments, the findings of ESM stud- 
ies are a fundamental extension to cross-sectional data. ESM has also been shown to 
contribute to a better understanding of the interplay between psychopathological experi- 
ences and environmental features, particularly since most of these interaction patterns 
are subtle and not consciously appraised. 

Together, the data from the ESM literature suggest that the study of the variability 
and the dynamic patterns of reactivity of experiences, as they unfold in daily life, may 
accelerate our understanding of psychological mechanisms underlying psychopathology. 
In addition, it may foster new research into treatment response and prove to be essential 
in the rapidly developing field of G x E interactions. ESM allows capture of the film rather 
than a snapshot of daily life reality in psychopathology. 
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